Home > Mobile >  Calculating median of a column sequentially in R and storing values in dataframe
Calculating median of a column sequentially in R and storing values in dataframe

Time:12-04

I am facing an issue to compute the median of values in a column. I have hundreds of values in a column but I want to compute the median of five values in a sequence at a time and store the median value in a separate data frame in R and similarly continue until the end of data values available.

The issue is there are some values with #VALUE! and while computing the median if there comes such a value i.e. #VALUE! and there are less than 5 values to take median, then the program should just take the median with whatever number of values are available. Similarly, for the last median value, if there are less than 5 values available to take a median, then it should compute the median with the available number of values.

The link to the .csv file with one column of sample data could be downloaded from here.

LINK to FILE

If someone can help me with this I would be very grateful.

Thank you

CodePudding user response:

You could try this:

  1. Group by a sequence of 5 rows
  2. replace #VALUE! with NA
  3. convert to numeric
  4. summarise with median
Speed %>% 
  group_by(group5 = rep(row_number(), each=5, length.out = n())) %>% 
  mutate(speed = ifelse(speed== "#VALUE!", NA, speed)) %>% 
  type.convert(as.is = TRUE) %>% 
  summarise(median = median(speed, na.rm = TRUE))
  group5 median
    <int>  <dbl>
 1      1   1.32
 2      2  -4.97
 3      3 -13.1 
 4      4 -14.3 
 5      5   6.89
 6      6  -2.97
 7      7 -11.6 
 8      8 -16.0 
 9      9 -18.6 
10     10 -19.9 
# ... with 72 more rows

CodePudding user response:

Update for "tumbling window" (top, here) versus "rolling window" (below, kept for posterity/reference). Still using the dat from the top of the rolling-window discussion.

The discussion about #VALUE! (far below) may still be relevant, I'll include the code here.

base R

dat$speed <- suppressWarnings(as.numeric(dat$speed))
aggregate(dat$speed, list(grp = (seq_len(nrow(dat)) - 1) %/% 5), FUN = median, na.rm = TRUE)
#   grp       x
# 1   0  3.4245
# 2   1 -4.9730

dplyr

library(dplyr)
dat %>%
  mutate(speed = suppressWarnings(as.numeric(speed))) %>%
  group_by(grp = (seq_len(n()) - 1) %/% 5) %>%
  summarize(med5 = median(speed, na.rm = TRUE))
# # A tibble: 2 x 2
#     grp  med5
#   <dbl> <dbl>
# 1     0  3.42
# 2     1 -4.97

data.table

library(data.table)
as.data.table(dat)[, speed := suppressWarnings(as.numeric(speed))
  ][, .(med5 = median(speed, na.rm = TRUE)), by = .(grp = (seq_len(nrow(dat)) - 1) %/% 5)][]
#      grp    med5
#    <num>   <num>
# 1:     0  3.4245
# 2:     1 -4.9730

(This below is for rolling-windows, no longer relevant other than the definition of dat data.)


I copied the first 10 rows from that single-column frame, and get

dat <- structure(list(speed = c(0, 5.534, 1.315, 7.6865, -0.479, -0.4605, -4.311, -4.973, -7.69, -11.669)), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"), class = "data.frame")

With this, the basic operation is something like this:

newvec <- zoo::rollmedian(dat$speed, 5)
newvec
# [1]  1.3150  1.3150 -0.4605 -0.4790 -4.3110 -4.9730

Note that this creates nrow(dat) - (k - 1) (20) values, where k=5 (your window). In general, rolling operations tend to be reductive, but we have options to change that if needed. For instance, we can keep it the same length and pad the ends with NA; to do this, we can "align" the window left, center, or right:

zoo::rollmedian(dat$speed, 5, fill = NA, align = "left")
#  [1]  1.3150  1.3150 -0.4605 -0.4790 -4.3110 -4.9730      NA      NA      NA      NA
zoo::rollmedian(dat$speed, 5, fill = NA, align = "center")
#  [1]      NA      NA  1.3150  1.3150 -0.4605 -0.4790 -4.3110 -4.9730      NA      NA
zoo::rollmedian(dat$speed, 5, fill = NA, align = "right")
#  [1]      NA      NA      NA      NA  1.3150  1.3150 -0.4605 -0.4790 -4.3110 -4.9730

Let me show how alignment fits here. With align="center" (the default), we see:

dat$speed
#  [1]   0.0000   5.5340   1.3150   7.6865  -0.4790  -0.4605  -4.3110  -4.9730  -7.6900 -11.6690
###      `----------------------------------------'
###          take the median of these values,
###          and then assign that single value here:
###                                              /
###                             ,---------------'
###                            / 
#  [1]       NA       NA   1.3150   1.3150  -0.4605  -0.4790  -4.3110  -4.9730       NA       NA

dat$speed[1:5]
# [1]  0.0000  5.5340  1.3150  7.6865 -0.4790
median(dat$speed[1:5])
# [1] 1.315

For the next value,

#  [1]   0.0000   5.5340   1.3150   7.6865  -0.4790  -0.4605  -4.3110  -4.9730  -7.6900 -11.6690
###               `----------------------------------------'
###                   take the median of these values,
###                   and then assign that single value here:
###                                                       /
###                                      ,---------------'
###                                     / 
#  [1]       NA       NA   1.3150   1.3150  -0.4605  -0.4790  -4.3110  -4.9730       NA       NA

dat$speed[2:6]
# [1]  5.5340  1.3150  7.6865 -0.4790 -0.4605
median(dat$speed[2:6])
# [1] 1.315

So we can assign that easily to a new frame with data.frame(rollmed = newvec), with or without padding. If you want to assign it back to the original frame, one can do:

dat$rollmed <- zoo::rollmedian(dat$speed, 5, fill = NA, align = "center")
dat
#       speed rollmed
# 1    0.0000      NA
# 2    5.5340      NA
# 3    1.3150  1.3150
# 4    7.6865  1.3150
# 5   -0.4790 -0.4605
# 6   -0.4605 -0.4790
# 7   -4.3110 -4.3110
# 8   -4.9730 -4.9730
# 9   -7.6900      NA
# 10 -11.6690      NA

As for your #VALUE!, that likely manifests as a character column instead of numeric, in which case you have an extra step before all of the above.

I'll change one of my values to that error:

dat$speed[5] <- "#VALUE!"
dat
#      speed
# 1        0
# 2    5.534
# 3    1.315
# 4   7.6865
# 5  #VALUE!
# 6  -0.4605
# 7   -4.311
# 8   -4.973
# 9    -7.69
# 10 -11.669
str(dat)
# 'data.frame': 10 obs. of  1 variable:
#  $ speed: chr  "0" "5.534" "1.315" "7.6865" ...

(See? chr.)

From here, we can simply convert all to numbers, ignoring the warnings we get:

dat$speed <- suppressWarnings(as.numeric(dat$speed))
dat
#       speed
# 1    0.0000
# 2    5.5340
# 3    1.3150
# 4    7.6865
# 5        NA
# 6   -0.4605
# 7   -4.3110
# 8   -4.9730
# 9   -7.6900
# 10 -11.6690
str(dat)
# 'data.frame': 10 obs. of  1 variable:
#  $ speed: num  0 5.53 1.31 7.69 NA ...

And from here, we can do the rolling median again. Note that the NA we now have changes things slightly:

dat$rollmed <- zoo::rollmedian(dat$speed, 5, fill = NA, align = "center")
dat$rollmed2 <- zoo::rollmedian(dat$speed, 5, fill = NA, align = "center", na.rm = TRUE)
dat
#       speed rollmed rollmed2
# 1    0.0000      NA       NA
# 2    5.5340      NA       NA
# 3    1.3150      NA  3.42450
# 4    7.6865      NA  3.42450
# 5        NA      NA  0.42725
# 6   -0.4605      NA -2.38575
# 7   -4.3110      NA -4.64200
# 8   -4.9730  -4.973 -4.97300
# 9   -7.6900      NA       NA
# 10 -11.6690      NA       NA

The default (what we did before) will return an NA median for every row within /- 4 rows (k-1) of the previous #VALUE!; we can add na.rm=TRUE if you would prefer; that is not a rolling-window thing, that's a general-statistics question, "are empty values an issue".

  • Related