How to replace non-sampled value with previous instance value in time series data file using bash sc-CodePudding

I have time series data wherein measurement values from different sensors have been captured asynchronously in same ascii file. The values are white space separated.

Original file looks like below.

2022-04-03 21:42:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:45:30  NOTSAMPLED 460     NOTSAMPLED
2022-04-03 21:46:30  NOTSAMPLED NOTSAMPLED      CLOSE
2022-04-03 21:47:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:48:30  NOTSAMPLED 460     NOTSAMPLED
2022-04-03 21:49:30  NOTSAMPLED NOTSAMPLED      CLOSE
2022-04-03 21:50:30  10.19      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:51:30  NOTSAMPLED 460  NOTSAMPLED
2022-04-03 21:52:30  NOTSAMPLED NOTSAMPLED      OPEN
2022-04-03 21:53:30  10.19      NOTSAMPLED      NOTSAMPLED

Now barring the non availability of the measurements at a particular time when the other measurement value is available,I need to replace the string "NOTSAMPLED" with the previous instance value of the other sensor like below.

2022-04-03 21:42:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:45:30  10.20      460     NOTSAMPLED
2022-04-03 21:46:30  10.20      460     CLOSE
2022-04-03 21:47:30  10.20      460     CLOSE
2022-04-03 21:48:30  10.20      460     CLOSE
2022-04-03 21:49:30  10.20      460     CLOSE
2022-04-03 21:50:30  10.19      460     CLOSE
2022-04-03 21:51:30  10.19      460     CLOSE
2022-04-03 21:52:30  10.19      460     OPEN
2022-04-03 21:53:30  10.19      460     OPEN

Can it be achieved using sed/awk or any other bash shell scripting commands ?

CodePudding user response：

Here's an awk solution that fills down all the NONSAMPLED fields (starting from the field #3):

awk '
    NR == 1 { split($0, filldown) }
    {
        for (i = 3; i <= NF; i  )
            if ($i != "NOTSAMPLED")
                filldown[i] = $i
            else
                $i = filldown[i]
        print
    }
' file.txt

2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN

Here's the bash version implementing the same logic. Text processing with bash is slow and isn't considered a good practice, but you might understand it better if you're not familiar with awk:

#!/bin/bash
{
    read -ra filldown

    while read -ra fields
    do
        for ((i = 2; i < ${#fields[@]}; i  ))
        do
            if [[ ${fields[i]} != NOTSAMPLED ]]
            then
                filldown[i]=${fields[i]}
            else
                fields[i]=${filldown[i]}
            fi
        done
        printf '%s\n' "${fields[*]}"
    done
} < file.txt

CodePudding user response：

First of.. this is not an efficient answer, but it gets the job done, and shows what is going on.

The file text.txt contains the example input

#!/bin/bash
#set -x

# first set the variables for the first run
oldfield1="NOTSAMPLED"
oldfield2="NOTSAMPLED"
oldfield3="NOTSAMPLED"
oldfield4="NOTSAMPLED"
oldfield5="NOTSAMPLED"
NOTSAMPLED="NOTSAMPLED"

while read line; do

        field1=$(echo ${line}| cut -d ' ' -f 1)
        field2=$(echo ${line}| cut -d ' ' -f 2)
        field3=$(echo ${line}| cut -d ' ' -f 3)
        field4=$(echo ${line}| cut -d ' ' -f 4)
        field5=$(echo ${line}| cut -d ' ' -f 5)

        [[ ${field1} == ${NOTSAMPLED} ]] && field1=${oldfield1}
        [[ ${field2} == ${NOTSAMPLED} ]] && field2=${oldfield2}
        [[ ${field3} == ${NOTSAMPLED} ]] && field3=${oldfield3}
        [[ ${field4} == ${NOTSAMPLED} ]] && field4=${oldfield4}
        [[ ${field5} == ${NOTSAMPLED} ]] && field5=${oldfield5}

        echo "${field1} ${field2} ${field3} ${field4} ${field5}"

        oldfield1="${field1}"
        oldfield2="${field2}"
        oldfield3="${field3}"
        oldfield4="${field4}"
        oldfield5="${field5}"
done <test.txt

Output:

2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN

CodePudding user response：

UPDATE 1 :

  benchmarking results using a `2.66GB, 60.4mn row` synthetic version of sample data


 in0: 2.66GiB 0:00:36 [75.0MiB/s] [75.0MiB/s] [===..====>] 100%            
out9: 2.01GiB 0:00:36 [56.5MiB/s] [56.5MiB/s] [ <=> ]

 
  60,406,830 lines 2054.705 MB (2154514147) /dev/stdin

 % pvE0 < sample3.txt | mawk2 '

          BEGIN { ____["NOTSAMPLED"]
                OFS=sprintf("%c",(___=  _ (  _))^--___)
          } {
              if (NR<___) {
                        NF = split($!_,__)
          } else {    _=NF
                  do { __[_]=$_=($_ in ____)? \
                       __[_]:$_ } while(___<--_) } }_' | pvE9 | wc4

Input throughput rate of 75.0 MB/s, or ~1.66 mn rows/sec

=================================================

    < sample2.txt gawk -e '

      BEGIN { ____["NOTSAMPLED"]
              OFS=sprintf("%c",(___=  _ (  _))^--___)
          } {
              if (NR<___) {
                        NF = split($!_,__)
          } else {    _=NF
                  do { __[_]=$_=($_ in ____) ? \
                       __[_]:$_ } while(___<--_) } }_'

2022-04-03  21:42:30    10.20   NOTSAMPLED  NOTSAMPLED
2022-04-03  21:45:30    10.20   460     NOTSAMPLED
2022-04-03  21:46:30    10.20   460     CLOSE
2022-04-03  21:47:30    10.20   460     CLOSE
2022-04-03  21:48:30    10.20   460     CLOSE
2022-04-03  21:49:30    10.20   460     CLOSE
2022-04-03  21:50:30    10.19   460     CLOSE
2022-04-03  21:51:30    10.19   460     CLOSE
2022-04-03  21:52:30    10.19   460     OPEN
2022-04-03  21:53:30    10.19   460     OPEN

Tested and confirmed working on gawk 5.1.1, mawk 1.3.4, mawk 1.996, and macOS nawk

-- The 4Chan Teller