So I have this dataset in which, between other columns I have price, made like this
list(
investor = c(1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2),
asset = c("x", "x", "x", "x", "x", "x", "y", "y", "Y", "y", "y"),
price = c(10, 12, 0, 0, 0 , 14, 23, 25, 0, 0, 20),
class = "data.frame", row.names = c(NA, -11L)
)
I want R to change those zero in the price column by computing the average of the previous and next available number.
For example the third entry should be the average between 12 and 14, hence 13.
Then entry number four should then become the average between 13 and 14, and so on so forth.
So basically I want to substitute those zero with this type of computation.
I also think that a moving average function should be ok for those zero.
Please notice I have 6k different type of asset, hence I need to group_by(asset)
I'm a basic R user so I don't really have any idea on how to do this, but I would prefer a dplyr solution, since is the thing I'm more able to understand at the moment.
CodePudding user response:
This doesn't do exactly what you say, but potentially what you want. It does a linear interpolation between the numbers on either side of the 0s, so third number becomes 12.5 as it is filling 3 empty spots between 12 and 14 which I suspect is what you want, but there are options within zoo
for different methods of filling the NAs. In your Y
asset group there is no data to use for filling so I have left it as NA.
library(zoo)
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
test <- data.frame(
investor = c(1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2),
asset = c("x", "x", "x", "x", "x", "x", "y", "y", "Y", "y", "y"),
price = c(10, 12, 0, 0, 0 , 14, 23, 25, 0, 0, 20)
)
#set 0s to NAs as they are easier to fill
test = test %>% mutate(price = ifelse(price == 0, NA, price))
#use zoo::na.approx to fill
test = test %>% group_by(asset) %>%
mutate(price = zoo::na.approx(price,na.rm = F))
test
#> # A tibble: 11 x 3
#> # Groups: asset [3]
#> investor asset price
#> <dbl> <chr> <dbl>
#> 1 1 x 10
#> 2 1 x 12
#> 3 1 x 12.5
#> 4 2 x 13
#> 5 2 x 13.5
#> 6 2 x 14
#> 7 1 y 23
#> 8 1 y 25
#> 9 1 Y NA
#> 10 1 y 22.5
#> 11 2 y 20
Created on 2022-05-24 by the reprex package (v2.0.1)
CodePudding user response:
I don't think there's a canned function that will do exactly what you want. However, I wrote a function that would do that:
library(tidyverse)
dat <- data.frame(
investor = c(1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2),
asset = c("x", "x", "x", "x", "x", "x", "y", "y", "y", "y", "y"),
price = c(10, 12, 0, 0, 0 , 14, 23, 25, 0, 0, 20))
closest_nonmiss <- function(x){
for(i in 1:length(x)){
if(is.na(x[i])){
w <- which(!is.na(x))
i1 <- max(w[which(w < i)])
i2 <- min(w[which(w > i)])
x[i] <- mean(x[c(i1,i2)])
}
}
x
}
Here is how it would work:
dat %>%
group_by(asset) %>%
mutate(price = case_when(price == 0 ~ NA_real_,
TRUE ~ price)) %>%
mutate(price = closest_nonmiss(price))
#> # A tibble: 11 × 3
#> # Groups: asset [2]
#> investor asset price
#> <dbl> <chr> <dbl>
#> 1 1 x 10
#> 2 1 x 12
#> 3 1 x 13
#> 4 2 x 13.5
#> 5 2 x 13.8
#> 6 2 x 14
#> 7 1 y 23
#> 8 1 y 25
#> 9 1 y 22.5
#> 10 1 y 21.2
#> 11 2 y 20
Created on 2022-05-23 by the reprex package (v2.0.1)