Please help me to optimize the bash script. It takes too much time to execute.
Requirements:
The log file that I am working with has some rows with date at the beginning of row, and some rows are without date at the beginning of the row.
I need to insert date from upper row if date is absent at start of row.
I work in MingW64 under Windows 10.
Date is in format: 2022-06-09 17:47:08,371
Given file:
date1 string1
string2 date(just a date in log, not the date at the beginning of the row)
, string3
date2 string4
string5
]string6
date3 string7
date4 string8
date5 string9
Example of given file:
2022-06-09 10:00:01,000 string1
string2 2022-06-09 10:00:01,000 string2 2022 string2
, string3 string3 string3
2022-06-09 10:00:02,000 string4
string5
]string6 string6 string6
}
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9
Expected file:
date1 string1
date1 string2 date
date1 , string3
date2 string4
date2 string5
date2 ]string6
date3 string7
date4 string8
date5 string9
Example of given file:
2022-06-09 10:00:01,000 string1
2022-06-09 10:00:01,000 string2 2022-06-09 10:00:01,000 string2 2022 string2
2022-06-09 10:00:01,000 , string3 string3 string3
2022-06-09 10:00:02,000 string4
2022-06-09 10:00:02,000 string5
2022-06-09 10:00:02,000 ]string6 string6 string6
2022-06-09 10:00:02,000 }
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9
my script which needs optimization
I tried the following:
I did it with loop, it is very slow
nn_lines_to_replace=$(grep -Evn "^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}" "$file" | cut -d ":" -f1)
for nn_line in $nn_lines_to_replace ; do
replace=$(sed -n $(($nn_line-1))p "$file"|cut -d " " -f1-2)
sed -i ""$nn_line" s/^/$replace/" "$file"
done
Maybe it could be done with sed or awk.
If you have ideas how to optimize it or have better approach, please share, I will really appreciate any help
Update: I complicated the condition of this issue link
CodePudding user response:
I gave the next solution in a comment, which is working for OP:
Your solution should loop once through the file. For each line check for a date and put it in a var when it is a date, or write the last date using the var. You can use
awk '
/^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}/ {last_d_t=$1 " " $2; print; next}
{print last_d_t $0}
' inputfile
CodePudding user response:
I would harness GNU AWK
for this task following way, let file.txt
content be
2022-06-09 10:00:01,000 string1
string2 2022-06-09 10:00:01,000 string2 2022 string2
, string3 string3 string3
2022-06-09 10:00:02,000 string4
string5
]string6 string6 string6
}
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9
then
awk '/^20[0-9][0-9]-[0-9][0-9]-[0-9][0-9]/{d=substr($0, 1, 24);print;next}{print d $0}' file.txt
output
2022-06-09 10:00:01,000 string1
2022-06-09 10:00:01,000 string2 2022-06-09 10:00:01,000 string2 2022 string2
2022-06-09 10:00:01,000 , string3 string3 string3
2022-06-09 10:00:02,000 string4
2022-06-09 10:00:02,000 string5
2022-06-09 10:00:02,000 ]string6 string6 string6
2022-06-09 10:00:02,000 }
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9
Explanation: For lines starting with 20
followed by 2 digits followed by -
followed by 2 digits followed by -
followed by 2 digits I extract 24 first characters using substr
function and store that in variable d, print
line as is and instruct GNU AWK
to go to next
line, therefore second action is applied for all lines which do not match said regular expression - for them I print them prepended by value of variable d. Disclaimer: this solution assumes you are working with dates in 2000...2099 range and that any line starting with described regular expression does contain datetime string of fixed width and for each line without datetime string there exist line with datetime earlier.
(tested in gawk 4.2.1)
CodePudding user response:
It has a partial regex
to check for datetime
i.e. high-level digits seps formatting,
-- instead of per digit verification
{m,g}awk ' BEGIN { 1 _="-- ::," (__="[09]") 1 gsub("[[, :-]",(__)(__) "&",_) 1 sub("^", (__) (__), _) 1 gsub( !_, &-", _) 1 ___=length(FS="^"(_)" ")*(_^=__="") } $! NF = sprintf("%.*s%s%.*s", (_!=(NF-_) * ___, __, $!_, _< _, _ ~ NF ?"":__=substr($!_,_,___))'
|
2022-06-09 10:00:01,000 string1
2022-06-09 10:00:01,000 string1string2 2022-06-09 10:00:01,000 string2 2022 string2
2022-06-09 10:00:01,000 string1, string3 string3 string3
2022-06-09 10:00:02,000 string4
2022-06-09 10:00:02,000 string4string5
2022-06-09 10:00:02,000 string4]string6 string6 string6
2022-06-09 10:00:02,000 string4}
2022-06-09 10:00:03,000 string7 string7
2022-06-09 10:00:04,000 string8 string8
2022-06-09 10:00:05,000 string9
The strings are strangely clumping together though. Oh well, c'est la vie