I need to add a new column containing an specifica function to a data frame.
Basically i need to calculate an indicator which is the sum of the past 5 observations (in column "value1") multuplied by 100 and divided by column "value2" {this one not as a sum, just the simple observatio} of my sample data below.
somewhat like this (its not a formal notation): indicator = [sum (i-5) value1 / value2] * 100
the indicator must be calculate by country.
in case of countries or dates "mixed" in the data frame the formula need to be able to recognize and sum the correct values only, in the correct order.
If there is a NA value in the value 1, the formula should also be able to ignore this line as a computation. ex: 31/12, 1/01, 2/01, 3/01, 4/01 = NA, 05/01 --> the indicator of 06/01 will then take into account the past 5 valid observation, 31/12, 1/01, 2/01, 3/01, 05/01.
Important -> only use base R
Example of the data frame (my actual data frame is more complex)
set.seed(1)
Country <- c(rep("USA", 10),rep("UK", 10), rep("China", 10))
Value1 <- sample(x = c(120, 340, 423), size = 30, replace = TRUE)
Value2 <- sample(x = c(1,3,5,6,9), size = 30, replace = TRUE)
date <- seq(as.POSIXct('2020/01/01'),
as.POSIXct('2020/01/30'),
by = "1 day")
df = data.frame(Country, Value1, Value2, date)
I thank you all very much in advance. this one has bein very hard to crack :D
CodePudding user response:
Since it has to be done group-wise but in base R, you could use the split-apply-bind method
df2 <- do.call(rbind, lapply(split(df, df$Country), function(d) {
d <- d[order(d$date),]
d$computed <- 100 * d$Value1 / d$Value2
d$Result <- NA
for(i in 5:nrow(d)) d$Result[i] <- sum(tail(na.omit(d$computed[seq(i)]), 5))
d[!names(d) %in% "computed"]
}))
rn <- sapply(strsplit(rownames(df2), "\\."), function(x) as.numeric(x[2]))
`rownames<-`(df2[rn,], NULL)
#> Country Value1 Value2 date Result
#> 1 USA 423 9 2020-01-01 NA
#> 2 USA 120 3 2020-01-02 NA
#> 3 USA 120 3 2020-01-03 NA
#> 4 USA 423 5 2020-01-04 NA
#> 5 USA 120 1 2020-01-05 33160.00
#> 6 USA 120 1 2020-01-06 40460.00
#> 7 USA 120 3 2020-01-07 40460.00
#> 8 USA 340 1 2020-01-08 70460.00
#> 9 USA 423 6 2020-01-09 69050.00
#> 10 USA 340 9 2020-01-10 60827.78
#> 11 UK 340 5 2020-01-11 NA
#> 12 UK 423 6 2020-01-12 NA
#> 13 UK 423 3 2020-01-13 NA
#> 14 UK 340 1 2020-01-14 NA
#> 15 UK 120 3 2020-01-15 65950.00
#> 16 UK 120 9 2020-01-16 60483.33
#> 17 UK 423 1 2020-01-17 95733.33
#> 18 UK 423 9 2020-01-18 86333.33
#> 19 UK 340 1 2020-01-19 86333.33
#> 20 UK 340 3 2020-01-20 93666.67
#> 21 China 340 1 2020-01-21 NA
#> 22 China 340 9 2020-01-22 NA
#> 23 China 423 3 2020-01-23 NA
#> 24 China 120 1 2020-01-24 NA
#> 25 China 340 9 2020-01-25 67655.56
#> 26 China 340 5 2020-01-26 40455.56
#> 27 China 120 5 2020-01-27 39077.78
#> 28 China 340 9 2020-01-28 28755.56
#> 29 China 340 9 2020-01-29 20533.33
#> 30 China 423 5 2020-01-30 25215.56
Created on 2022-06-08 by the reprex package (v2.0.1)
CodePudding user response:
Here's an option - not sure if the calculation is as you intend:
split_df <- split(df, Country)
split_df <- lapply(split_df, function(x) {
x <- x[order(x$date),]
x$index <- nrow(x):1
x$indicator <- ifelse(x$index <= 5, sum(x$Value2[x$index <= 5]) * 100 / x$Value2, NA)
x$index <- NULL
return(x)
})
final_df <- do.call(rbind, split_df)
Country Value1 Value2 date indicator
China.21 China 120 3 2020-01-21 NA
China.22 China 423 5 2020-01-22 NA
China.23 China 340 6 2020-01-23 NA
China.24 China 120 3 2020-01-24 NA
China.25 China 340 9 2020-01-25 NA
China.26 China 423 6 2020-01-26 366.6667
China.27 China 120 3 2020-01-27 733.3333
China.28 China 340 3 2020-01-28 733.3333
China.29 China 120 5 2020-01-29 440.0000
China.30 China 340 5 2020-01-30 440.0000
UK.11 UK 423 1 2020-01-11 NA
UK.12 UK 340 6 2020-01-12 NA
UK.13 UK 423 1 2020-01-13 NA
UK.14 UK 423 5 2020-01-14 NA
UK.15 UK 340 6 2020-01-15 NA
UK.16 UK 340 1 2020-01-16 2400.0000
UK.17 UK 120 5 2020-01-17 480.0000
UK.18 UK 423 9 2020-01-18 266.6667
UK.19 UK 120 6 2020-01-19 400.0000
UK.20 UK 423 3 2020-01-20 800.0000
USA.1 USA 423 1 2020-01-01 NA
USA.2 USA 423 5 2020-01-02 NA
USA.3 USA 423 5 2020-01-03 NA
USA.4 USA 423 6 2020-01-04 NA
USA.5 USA 423 1 2020-01-05 NA
USA.6 USA 340 5 2020-01-06 600.0000
USA.7 USA 340 5 2020-01-07 600.0000
USA.8 USA 423 6 2020-01-08 500.0000
USA.9 USA 423 5 2020-01-09 600.0000
USA.10 USA 423 9 2020-01-10 333.3333
CodePudding user response:
In base R you could do:
transform(df,Results=ave(Value1,Country,FUN=function(x)replace(x,!is.na(x),
filter(na.omit(x),rep(1,5),sides=1)))/Value2)
Country Value1 Value2 date Results
1 USA 120 1 2020-01-01 NA
2 USA 423 6 2020-01-02 NA
3 USA 120 1 2020-01-03 NA
4 USA 340 6 2020-01-04 NA
5 USA 120 5 2020-01-05 224.6000
6 USA 423 3 2020-01-06 475.3333
7 USA 423 3 2020-01-07 475.3333
8 USA 340 6 2020-01-08 274.3333
9 USA 340 6 2020-01-09 274.3333
10 USA 423 6 2020-01-10 324.8333
11 UK 423 3 2020-01-11 NA
12 UK 120 6 2020-01-12 NA
13 UK 120 1 2020-01-13 NA
14 UK 120 1 2020-01-14 NA
15 UK 340 6 2020-01-15 187.1667
16 UK 340 1 2020-01-16 1040.0000
17 UK 340 3 2020-01-17 420.0000
18 UK 340 5 2020-01-18 296.0000
19 UK 423 3 2020-01-19 594.3333
20 UK 120 3 2020-01-20 521.0000
21 China 423 9 2020-01-21 NA
22 China 120 3 2020-01-22 NA
23 China 120 1 2020-01-23 NA
24 China 120 5 2020-01-24 NA
25 China 120 5 2020-01-25 180.6000
26 China 340 6 2020-01-26 136.6667
27 China 120 5 2020-01-27 164.0000
28 China 120 1 2020-01-28 820.0000
29 China 340 6 2020-01-29 173.3333
30 China 340 9 2020-01-30 140.0000