Home > OS >  R: Calculate variables of every n rows while keeping the date/timestamp
R: Calculate variables of every n rows while keeping the date/timestamp

Time:05-17

I have a large dataframe (34707060 obs) that consists of accelerometer data for x,y,z. Data was collected at 30Hz, meaning I have 30 rows of data for each second. See head of my data below.

                Timestamp Accelerometer.X Accelerometer.Y Accelerometer.Z
1 30/06/2021 08:00:00.000          -1.109          -1.559           1.508
2 30/06/2021 08:00:00.034          -0.688          -1.043           0.891
3 30/06/2021 08:00:00.067          -0.363          -0.531           0.555
4 30/06/2021 08:00:00.100          -0.164          -0.496           0.816
5 30/06/2021 08:00:00.134           0.063          -0.363           0.496
6 30/06/2021 08:00:00.167          -0.098          -0.992           0.227

I would like to compress this dataset to have data for every second, by calculating the mean, minimum, maximum, sum and standard deviation of every 30 rows. I would like to keep the Timestamp with data and time.

I have tried to apply the following code to my dataframe, which I copied from the answer of det on the question here:

df %>% group_by(group=row_number() %/% 30) %>%
  dplyr::summarize(
    Timestamp = first(Timestamp),
    X_mean=mean(Accelerometer.X),
    Y_mean=mean(Accelerometer.Y),
    Z_mean=mean(Accelerometer.Z),
    
    X_min=min(Accelerometer.X),
    Y_min=min(Accelerometer.Y),
    Z_min=min(Accelerometer.Z),
    
    X_max=max(Accelerometer.X),
    Y_max=max(Accelerometer.Y),
    Z_max=max(Accelerometer.Z),
    
    X_sum=sum(Accelerometer.X),
    Y_sum=sum(Accelerometer.Y),
    Z_sum=sum(Accelerometer.Z),    
    
    X_sd=sd(Accelerometer.X),
    Y_sd=sd(Accelerometer.Y),
    Z_sd=sd(Accelerometer.Z),
  )

Unfortunately, this does not give me the result I want (see below).

# A tibble: 5 × 5
  group Timestamp               X_mean Y_mean Z_mean
  <dbl> <chr>                    <dbl>  <dbl>  <dbl>
1     0 30/06/2021 08:00:00.000 -0.576 -0.989  0.431
2     1 30/06/2021 08:00:00.967 -0.240 -1.06   0.270
3     2 30/06/2021 08:00:01.967 -0.287 -0.821  0.390
4     3 30/06/2021 08:00:02.967 -0.364 -0.830  0.337
5     4 30/06/2021 08:00:03.967 -0.332 -0.961 -0.086

The way it looks to me now, it first calculates all the values for the first 30 rows, and then includes these calculated values as the first row of 30 in the next calculation. So rather than calculating the compressed values for rows 1:30, 31-60, 61-90 etc, it keeps applying the code to lines 1:30.

I am not sure how to adjust the code to calculate the mean, min, max, sum, sd for every 30 rows (so 1:30, 31:60 etc.). Would really appreciate some help.

CodePudding user response:

You can use dmy_hms to convert your row to a lubridate objectand use floor_date to round to the second. Then I'd rather use across here to compute the mean, min, max and sd

library(lubridate)
library(dplyr)

dat %>% 
  group_by(sec = floor_date(dmy_hms(Timestamp), "second")) %>% 
  summarise(Timestamp = first(Timestamp),
            across(-Timestamp, 
                   list(mean = mean, min = min, max = max, sd = sd),
                   .names = "{.col}_{.fn}"))
  •  Tags:  
  • r
  • Related