I am facing an issue to compute the median of values in a column. I have hundreds of values in a column but I want to compute the median of five values in a sequence at a time and store the median value in a separate data frame in R and similarly continue until the end of data values available.
The issue is there are some values with #VALUE! and while computing the median if there comes such a value i.e. #VALUE! and there are less than 5 values to take median, then the program should just take the median with whatever number of values are available. Similarly, for the last median value, if there are less than 5 values available to take a median, then it should compute the median with the available number of values.
The link to the .csv file with one column of sample data could be downloaded from here.
If someone can help me with this I would be very grateful.
Thank you
CodePudding user response:
You could try this:
- Group by a sequence of 5 rows
- replace
#VALUE!
withNA
- convert to numeric
- summarise with median
Speed %>%
group_by(group5 = rep(row_number(), each=5, length.out = n())) %>%
mutate(speed = ifelse(speed== "#VALUE!", NA, speed)) %>%
type.convert(as.is = TRUE) %>%
summarise(median = median(speed, na.rm = TRUE))
group5 median
<int> <dbl>
1 1 1.32
2 2 -4.97
3 3 -13.1
4 4 -14.3
5 5 6.89
6 6 -2.97
7 7 -11.6
8 8 -16.0
9 9 -18.6
10 10 -19.9
# ... with 72 more rows
CodePudding user response:
Update for "tumbling window" (top, here) versus "rolling window" (below, kept for posterity/reference). Still using the dat
from the top of the rolling-window discussion.
The discussion about #VALUE!
(far below) may still be relevant, I'll include the code here.
base R
dat$speed <- suppressWarnings(as.numeric(dat$speed))
aggregate(dat$speed, list(grp = (seq_len(nrow(dat)) - 1) %/% 5), FUN = median, na.rm = TRUE)
# grp x
# 1 0 3.4245
# 2 1 -4.9730
dplyr
library(dplyr)
dat %>%
mutate(speed = suppressWarnings(as.numeric(speed))) %>%
group_by(grp = (seq_len(n()) - 1) %/% 5) %>%
summarize(med5 = median(speed, na.rm = TRUE))
# # A tibble: 2 x 2
# grp med5
# <dbl> <dbl>
# 1 0 3.42
# 2 1 -4.97
data.table
library(data.table)
as.data.table(dat)[, speed := suppressWarnings(as.numeric(speed))
][, .(med5 = median(speed, na.rm = TRUE)), by = .(grp = (seq_len(nrow(dat)) - 1) %/% 5)][]
# grp med5
# <num> <num>
# 1: 0 3.4245
# 2: 1 -4.9730
(This below is for rolling-windows, no longer relevant other than the definition of dat
data.)
I copied the first 10 rows from that single-column frame, and get
dat <- structure(list(speed = c(0, 5.534, 1.315, 7.6865, -0.479, -0.4605, -4.311, -4.973, -7.69, -11.669)), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"), class = "data.frame")
With this, the basic operation is something like this:
newvec <- zoo::rollmedian(dat$speed, 5)
newvec
# [1] 1.3150 1.3150 -0.4605 -0.4790 -4.3110 -4.9730
Note that this creates nrow(dat) - (k - 1)
(20) values, where k=5
(your window). In general, rolling operations tend to be reductive, but we have options to change that if needed. For instance, we can keep it the same length and pad the ends with NA
; to do this, we can "align" the window left, center, or right:
zoo::rollmedian(dat$speed, 5, fill = NA, align = "left")
# [1] 1.3150 1.3150 -0.4605 -0.4790 -4.3110 -4.9730 NA NA NA NA
zoo::rollmedian(dat$speed, 5, fill = NA, align = "center")
# [1] NA NA 1.3150 1.3150 -0.4605 -0.4790 -4.3110 -4.9730 NA NA
zoo::rollmedian(dat$speed, 5, fill = NA, align = "right")
# [1] NA NA NA NA 1.3150 1.3150 -0.4605 -0.4790 -4.3110 -4.9730
Let me show how alignment fits here. With align="center"
(the default), we see:
dat$speed
# [1] 0.0000 5.5340 1.3150 7.6865 -0.4790 -0.4605 -4.3110 -4.9730 -7.6900 -11.6690
### `----------------------------------------'
### take the median of these values,
### and then assign that single value here:
### /
### ,---------------'
### /
# [1] NA NA 1.3150 1.3150 -0.4605 -0.4790 -4.3110 -4.9730 NA NA
dat$speed[1:5]
# [1] 0.0000 5.5340 1.3150 7.6865 -0.4790
median(dat$speed[1:5])
# [1] 1.315
For the next value,
# [1] 0.0000 5.5340 1.3150 7.6865 -0.4790 -0.4605 -4.3110 -4.9730 -7.6900 -11.6690
### `----------------------------------------'
### take the median of these values,
### and then assign that single value here:
### /
### ,---------------'
### /
# [1] NA NA 1.3150 1.3150 -0.4605 -0.4790 -4.3110 -4.9730 NA NA
dat$speed[2:6]
# [1] 5.5340 1.3150 7.6865 -0.4790 -0.4605
median(dat$speed[2:6])
# [1] 1.315
So we can assign that easily to a new frame with data.frame(rollmed = newvec)
, with or without padding. If you want to assign it back to the original frame, one can do:
dat$rollmed <- zoo::rollmedian(dat$speed, 5, fill = NA, align = "center")
dat
# speed rollmed
# 1 0.0000 NA
# 2 5.5340 NA
# 3 1.3150 1.3150
# 4 7.6865 1.3150
# 5 -0.4790 -0.4605
# 6 -0.4605 -0.4790
# 7 -4.3110 -4.3110
# 8 -4.9730 -4.9730
# 9 -7.6900 NA
# 10 -11.6690 NA
As for your #VALUE!
, that likely manifests as a character
column instead of numeric
, in which case you have an extra step before all of the above.
I'll change one of my values to that error:
dat$speed[5] <- "#VALUE!"
dat
# speed
# 1 0
# 2 5.534
# 3 1.315
# 4 7.6865
# 5 #VALUE!
# 6 -0.4605
# 7 -4.311
# 8 -4.973
# 9 -7.69
# 10 -11.669
str(dat)
# 'data.frame': 10 obs. of 1 variable:
# $ speed: chr "0" "5.534" "1.315" "7.6865" ...
(See? chr
.)
From here, we can simply convert all to numbers, ignoring the warnings we get:
dat$speed <- suppressWarnings(as.numeric(dat$speed))
dat
# speed
# 1 0.0000
# 2 5.5340
# 3 1.3150
# 4 7.6865
# 5 NA
# 6 -0.4605
# 7 -4.3110
# 8 -4.9730
# 9 -7.6900
# 10 -11.6690
str(dat)
# 'data.frame': 10 obs. of 1 variable:
# $ speed: num 0 5.53 1.31 7.69 NA ...
And from here, we can do the rolling median again. Note that the NA
we now have changes things slightly:
dat$rollmed <- zoo::rollmedian(dat$speed, 5, fill = NA, align = "center")
dat$rollmed2 <- zoo::rollmedian(dat$speed, 5, fill = NA, align = "center", na.rm = TRUE)
dat
# speed rollmed rollmed2
# 1 0.0000 NA NA
# 2 5.5340 NA NA
# 3 1.3150 NA 3.42450
# 4 7.6865 NA 3.42450
# 5 NA NA 0.42725
# 6 -0.4605 NA -2.38575
# 7 -4.3110 NA -4.64200
# 8 -4.9730 -4.973 -4.97300
# 9 -7.6900 NA NA
# 10 -11.6690 NA NA
The default (what we did before) will return an NA
median for every row within /- 4 rows (k-1
) of the previous #VALUE!
; we can add na.rm=TRUE
if you would prefer; that is not a rolling-window thing, that's a general-statistics question, "are empty values an issue".