I have time series data wherein measurement values from different sensors have been captured asynchronously in same ascii file. The values are white space separated.
Original file looks like below.
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 NOTSAMPLED 460 NOTSAMPLED
2022-04-03 21:46:30 NOTSAMPLED NOTSAMPLED CLOSE
2022-04-03 21:47:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:48:30 NOTSAMPLED 460 NOTSAMPLED
2022-04-03 21:49:30 NOTSAMPLED NOTSAMPLED CLOSE
2022-04-03 21:50:30 10.19 NOTSAMPLED NOTSAMPLED
2022-04-03 21:51:30 NOTSAMPLED 460 NOTSAMPLED
2022-04-03 21:52:30 NOTSAMPLED NOTSAMPLED OPEN
2022-04-03 21:53:30 10.19 NOTSAMPLED NOTSAMPLED
Now barring the non availability of the measurements at a particular time when the other measurement value is available,I need to replace the string "NOTSAMPLED" with the previous instance value of the other sensor like below.
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN
Can it be achieved using sed/awk or any other bash shell scripting commands ?
CodePudding user response:
Here's an awk
solution that fills down all the NONSAMPLED
fields (starting from the field #3):
awk '
NR == 1 { split($0, filldown) }
{
for (i = 3; i <= NF; i )
if ($i != "NOTSAMPLED")
filldown[i] = $i
else
$i = filldown[i]
print
}
' file.txt
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN
Here's the bash
version implementing the same logic. Text processing with bash is slow and isn't considered a good practice, but you might understand it better if you're not familiar with awk
:
#!/bin/bash
{
read -ra filldown
while read -ra fields
do
for ((i = 2; i < ${#fields[@]}; i ))
do
if [[ ${fields[i]} != NOTSAMPLED ]]
then
filldown[i]=${fields[i]}
else
fields[i]=${filldown[i]}
fi
done
printf '%s\n' "${fields[*]}"
done
} < file.txt
CodePudding user response:
First of.. this is not an efficient answer, but it gets the job done, and shows what is going on.
The file text.txt contains the example input
#!/bin/bash
#set -x
# first set the variables for the first run
oldfield1="NOTSAMPLED"
oldfield2="NOTSAMPLED"
oldfield3="NOTSAMPLED"
oldfield4="NOTSAMPLED"
oldfield5="NOTSAMPLED"
NOTSAMPLED="NOTSAMPLED"
while read line; do
field1=$(echo ${line}| cut -d ' ' -f 1)
field2=$(echo ${line}| cut -d ' ' -f 2)
field3=$(echo ${line}| cut -d ' ' -f 3)
field4=$(echo ${line}| cut -d ' ' -f 4)
field5=$(echo ${line}| cut -d ' ' -f 5)
[[ ${field1} == ${NOTSAMPLED} ]] && field1=${oldfield1}
[[ ${field2} == ${NOTSAMPLED} ]] && field2=${oldfield2}
[[ ${field3} == ${NOTSAMPLED} ]] && field3=${oldfield3}
[[ ${field4} == ${NOTSAMPLED} ]] && field4=${oldfield4}
[[ ${field5} == ${NOTSAMPLED} ]] && field5=${oldfield5}
echo "${field1} ${field2} ${field3} ${field4} ${field5}"
oldfield1="${field1}"
oldfield2="${field2}"
oldfield3="${field3}"
oldfield4="${field4}"
oldfield5="${field5}"
done <test.txt
Output:
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN
CodePudding user response:
UPDATE 1 :
benchmarking results using a `2.66GB, 60.4mn row` synthetic version of sample data
in0: 2.66GiB 0:00:36 [75.0MiB/s] [75.0MiB/s] [===..====>] 100%
out9: 2.01GiB 0:00:36 [56.5MiB/s] [56.5MiB/s] [ <=> ]
60,406,830 lines 2054.705 MB (2154514147) /dev/stdin
% pvE0 < sample3.txt | mawk2 '
BEGIN { ____["NOTSAMPLED"]
OFS=sprintf("%c",(___= _ ( _))^--___)
} {
if (NR<___) {
NF = split($!_,__)
} else { _=NF
do { __[_]=$_=($_ in ____)? \
__[_]:$_ } while(___<--_) } }_' | pvE9 | wc4
Input throughput rate of 75.0 MB/s
, or ~1.66 mn rows/sec
=================================================
< sample2.txt gawk -e '
BEGIN { ____["NOTSAMPLED"]
OFS=sprintf("%c",(___= _ ( _))^--___)
} {
if (NR<___) {
NF = split($!_,__)
} else { _=NF
do { __[_]=$_=($_ in ____) ? \
__[_]:$_ } while(___<--_) } }_'
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN
Tested and confirmed working on gawk 5.1.1
, mawk 1.3.4
, mawk 1.996
, and macOS nawk
-- The 4Chan Teller