Home > Software design >  How to replace non-sampled value with previous instance value in time series data file using bash sc
How to replace non-sampled value with previous instance value in time series data file using bash sc

Time:04-05

I have time series data wherein measurement values from different sensors have been captured asynchronously in same ascii file. The values are white space separated.

Original file looks like below.

2022-04-03 21:42:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:45:30  NOTSAMPLED 460     NOTSAMPLED
2022-04-03 21:46:30  NOTSAMPLED NOTSAMPLED      CLOSE
2022-04-03 21:47:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:48:30  NOTSAMPLED 460     NOTSAMPLED
2022-04-03 21:49:30  NOTSAMPLED NOTSAMPLED      CLOSE
2022-04-03 21:50:30  10.19      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:51:30  NOTSAMPLED 460  NOTSAMPLED
2022-04-03 21:52:30  NOTSAMPLED NOTSAMPLED      OPEN
2022-04-03 21:53:30  10.19      NOTSAMPLED      NOTSAMPLED

Now barring the non availability of the measurements at a particular time when the other measurement value is available,I need to replace the string "NOTSAMPLED" with the previous instance value of the other sensor like below.

2022-04-03 21:42:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:45:30  10.20      460     NOTSAMPLED
2022-04-03 21:46:30  10.20      460     CLOSE
2022-04-03 21:47:30  10.20      460     CLOSE
2022-04-03 21:48:30  10.20      460     CLOSE
2022-04-03 21:49:30  10.20      460     CLOSE
2022-04-03 21:50:30  10.19      460     CLOSE
2022-04-03 21:51:30  10.19      460     CLOSE
2022-04-03 21:52:30  10.19      460     OPEN
2022-04-03 21:53:30  10.19      460     OPEN

Can it be achieved using sed/awk or any other bash shell scripting commands ?

CodePudding user response:

Here's an awk solution that fills down all the NONSAMPLED fields (starting from the field #3):

awk '
    NR == 1 { split($0, filldown) }
    {
        for (i = 3; i <= NF; i  )
            if ($i != "NOTSAMPLED")
                filldown[i] = $i
            else
                $i = filldown[i]
        print
    }
' file.txt
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN

Here's the bash version implementing the same logic. Text processing with bash is slow and isn't considered a good practice, but you might understand it better if you're not familiar with awk:

#!/bin/bash
{
    read -ra filldown

    while read -ra fields
    do
        for ((i = 2; i < ${#fields[@]}; i  ))
        do
            if [[ ${fields[i]} != NOTSAMPLED ]]
            then
                filldown[i]=${fields[i]}
            else
                fields[i]=${filldown[i]}
            fi
        done
        printf '%s\n' "${fields[*]}"
    done
} < file.txt

CodePudding user response:

First of.. this is not an efficient answer, but it gets the job done, and shows what is going on.

The file text.txt contains the example input

#!/bin/bash
#set -x

# first set the variables for the first run
oldfield1="NOTSAMPLED"
oldfield2="NOTSAMPLED"
oldfield3="NOTSAMPLED"
oldfield4="NOTSAMPLED"
oldfield5="NOTSAMPLED"
NOTSAMPLED="NOTSAMPLED"

while read line; do

        field1=$(echo ${line}| cut -d ' ' -f 1)
        field2=$(echo ${line}| cut -d ' ' -f 2)
        field3=$(echo ${line}| cut -d ' ' -f 3)
        field4=$(echo ${line}| cut -d ' ' -f 4)
        field5=$(echo ${line}| cut -d ' ' -f 5)

        [[ ${field1} == ${NOTSAMPLED} ]] && field1=${oldfield1}
        [[ ${field2} == ${NOTSAMPLED} ]] && field2=${oldfield2}
        [[ ${field3} == ${NOTSAMPLED} ]] && field3=${oldfield3}
        [[ ${field4} == ${NOTSAMPLED} ]] && field4=${oldfield4}
        [[ ${field5} == ${NOTSAMPLED} ]] && field5=${oldfield5}

        echo "${field1} ${field2} ${field3} ${field4} ${field5}"

        oldfield1="${field1}"
        oldfield2="${field2}"
        oldfield3="${field3}"
        oldfield4="${field4}"
        oldfield5="${field5}"
done <test.txt

Output:

2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN

CodePudding user response:

UPDATE 1 :

  benchmarking results using a `2.66GB, 60.4mn row` synthetic version of sample data


 in0: 2.66GiB 0:00:36 [75.0MiB/s] [75.0MiB/s] [===..====>] 100%            
out9: 2.01GiB 0:00:36 [56.5MiB/s] [56.5MiB/s] [ <=> ]

 
  60,406,830 lines 2054.705 MB (2154514147) /dev/stdin

 % pvE0 < sample3.txt | mawk2 '

          BEGIN { ____["NOTSAMPLED"]
                OFS=sprintf("%c",(___=  _ (  _))^--___)
          } {
              if (NR<___) {
                        NF = split($!_,__)
          } else {    _=NF
                  do { __[_]=$_=($_ in ____)? \
                       __[_]:$_ } while(___<--_) } }_' | pvE9 | wc4

Input throughput rate of 75.0 MB/s, or ~1.66 mn rows/sec

=================================================

    < sample2.txt gawk -e '

      BEGIN { ____["NOTSAMPLED"]
              OFS=sprintf("%c",(___=  _ (  _))^--___)
          } {
              if (NR<___) {
                        NF = split($!_,__)
          } else {    _=NF
                  do { __[_]=$_=($_ in ____) ? \
                       __[_]:$_ } while(___<--_) } }_'

2022-04-03  21:42:30    10.20   NOTSAMPLED  NOTSAMPLED
2022-04-03  21:45:30    10.20   460     NOTSAMPLED
2022-04-03  21:46:30    10.20   460     CLOSE
2022-04-03  21:47:30    10.20   460     CLOSE
2022-04-03  21:48:30    10.20   460     CLOSE
2022-04-03  21:49:30    10.20   460     CLOSE
2022-04-03  21:50:30    10.19   460     CLOSE
2022-04-03  21:51:30    10.19   460     CLOSE
2022-04-03  21:52:30    10.19   460     OPEN
2022-04-03  21:53:30    10.19   460     OPEN

Tested and confirmed working on gawk 5.1.1, mawk 1.3.4, mawk 1.996, and macOS nawk

-- The 4Chan Teller

  • Related