Home > database >  Bash script to insert date from upper row if date is absent at start of row
Bash script to insert date from upper row if date is absent at start of row

Time:06-16

Please help me to optimize the bash script. It takes too much time to execute.

Requirements: The log file that I am working with has some rows with date at the beginning of row, and some rows are without date at the beginning of the row.
I need to insert date from upper row if date is absent at start of row.
I work in MingW64 under Windows 10.
Date is in format: 2022-06-09 17:47:08,371

Given file:
date1 string1
string2 date(just a date in log, not the date at the beginning of the row)
, string3
date2 string4
string5
]string6
date3 string7
date4 string8
date5 string9

Example of given file:

2022-06-09 10:00:01,000 string1
string2 2022-06-09 10:00:01,000 string2 2022 string2
, string3 string3 string3
2022-06-09 10:00:02,000 string4
string5
]string6 string6 string6
}
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9

Expected file:
date1 string1
date1 string2 date
date1 , string3
date2 string4
date2 string5
date2 ]string6
date3 string7
date4 string8
date5 string9

Example of given file:

2022-06-09 10:00:01,000 string1
2022-06-09 10:00:01,000 string2 2022-06-09 10:00:01,000 string2 2022 string2
2022-06-09 10:00:01,000 , string3 string3 string3
2022-06-09 10:00:02,000 string4
2022-06-09 10:00:02,000 string5
2022-06-09 10:00:02,000 ]string6 string6 string6
2022-06-09 10:00:02,000 }
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9

my script which needs optimization
I tried the following:
I did it with loop, it is very slow

nn_lines_to_replace=$(grep  -Evn "^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}"  "$file" | cut -d ":" -f1)
    for nn_line in $nn_lines_to_replace ; do
      replace=$(sed -n $(($nn_line-1))p "$file"|cut -d " " -f1-2)
      sed -i ""$nn_line" s/^/$replace/" "$file"
    done

Maybe it could be done with sed or awk.

If you have ideas how to optimize it or have better approach, please share, I will really appreciate any help

Update: I complicated the condition of this issue link

CodePudding user response:

I gave the next solution in a comment, which is working for OP:

Your solution should loop once through the file. For each line check for a date and put it in a var when it is a date, or write the last date using the var. You can use

awk '
  /^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/ {last_d_t=$1 " " $2; print; next}
 {print last_d_t $0}
' inputfile

CodePudding user response:

I would harness GNU AWK for this task following way, let file.txt content be

2022-06-09 10:00:01,000 string1
string2 2022-06-09 10:00:01,000 string2 2022 string2
, string3 string3 string3
2022-06-09 10:00:02,000 string4
string5
]string6 string6 string6
}
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9

then

awk '/^20[0-9][0-9]-[0-9][0-9]-[0-9][0-9]/{d=substr($0, 1, 24);print;next}{print d $0}' file.txt

output

2022-06-09 10:00:01,000 string1
2022-06-09 10:00:01,000 string2 2022-06-09 10:00:01,000 string2 2022 string2
2022-06-09 10:00:01,000 , string3 string3 string3
2022-06-09 10:00:02,000 string4
2022-06-09 10:00:02,000 string5
2022-06-09 10:00:02,000 ]string6 string6 string6
2022-06-09 10:00:02,000 }
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9

Explanation: For lines starting with 20 followed by 2 digits followed by - followed by 2 digits followed by - followed by 2 digits I extract 24 first characters using substr function and store that in variable d, print line as is and instruct GNU AWK to go to next line, therefore second action is applied for all lines which do not match said regular expression - for them I print them prepended by value of variable d. Disclaimer: this solution assumes you are working with dates in 2000...2099 range and that any line starting with described regular expression does contain datetime string of fixed width and for each line without datetime string there exist line with datetime earlier.

(tested in gawk 4.2.1)

CodePudding user response:

It has a partial regex to check for datetime

  • i.e. high-level digits seps formatting,

    -- instead of per digit verification

    {m,g}awk '
    BEGIN {
     1          _="-- ::," (__="[09]")
     1         gsub("[[, :-]",(__)(__) "&",_)
     1          sub("^", (__) (__), _)
     1         gsub( !_,       &-", _)
    
     1   ___=length(FS="^"(_)" ")*(_^=__="")
    }
    $! NF = sprintf("%.*s%s%.*s", (_!=(NF-_) * ___, __, $!_, _< _,
                                   _ ~ NF ?"":__=substr($!_,_,___))'
    

|

2022-06-09 10:00:01,000 string1
2022-06-09 10:00:01,000 string1string2 2022-06-09 10:00:01,000 string2 2022 string2
2022-06-09 10:00:01,000 string1, string3 string3 string3
2022-06-09 10:00:02,000 string4
2022-06-09 10:00:02,000 string4string5
2022-06-09 10:00:02,000 string4]string6 string6 string6
2022-06-09 10:00:02,000 string4}
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9

The strings are strangely clumping together though. Oh well, c'est la vie

  • Related