I have a df with 6 columns. I want to find the delta based on the date and also group by first few columns.
df <- data.frame (col1 = c('A1','A1','A1','A2','A2','A2','A1','A1','A1','A2','A2','A2','A2','A2','A2'),
col2 = c('B1','B2','B3','B1','B2','B3','B1','B2','B3','B1','B2','B3','B1','B2','B3'),
col3 = c('C1','C2','C3','C1','C2','C3','C1','C2','C3','C1','C2','C3','C1','C2','C3'),
col4 = c('D1','D2','D22','D4','D5','D6','D1','D2','D3','D4','D5','D6','D7','D8','D9'),
col5 = c('1/01/2021','1/01/2021','1/01/2021','1/01/2021','1/01/2021','1/01/2021',
'1/01/2022','1/01/2022','1/01/2022','1/01/2022','1/01/2022','1/01/2022',
'1/01/2022','1/01/2022','1/01/2022'),
col6 = c(10,20,30,40,50,60,100, 200, 300,400,500,600,60,60, 60)
)
diff_na<-df%>%
group_by(col1,col2,col3,col4) %>%
mutate(diff = col6 - lag(col6, default = first(col6,default = 0), order_by = col5))
Expected output is :
df11 <- data.frame (col1 = c('A1','A1','A1','A2','A2','A2','A1','A1','A1','A2','A2','A2','A2','A2','A2'),
col2 = c('B1','B2','B3','B1','B2','B3','B1','B2','B3','B1','B2','B3','B1','B2','B3'),
col3 = c('C1','C2','C3','C1','C2','C3','C1','C2','C3','C1','C2','C3','C1','C2','C3'),
col4 = c('D1','D2','D22','D4','D5','D6','D1','D2','D3','D4','D5','D6','D7','D8','D9'),
col5 = c('1/01/2021','1/01/2021','1/01/2021','1/01/2021','1/01/2021','1/01/2021',
'1/01/2022','1/01/2022','1/01/2022','1/01/2022','1/01/2022','1/01/2022',
'1/01/2022','1/01/2022','1/01/2022'),
col6 = c(10,20,30,40,50,60,100, 200, 300,400,500,600,60,60, 60),
dfiff =c(0,0,30,0,0,0,90,180,300,360,450,540,60,60,60)
)
I am facing an issue if the value in the previous col4 is not there , then it does not subtract the value. I mean it should treat the missing value as 0. I tried giving first default as 0. But somehow the last three diff values are 0 instead of 60. Please guide where I am going wrong.
Row 3 has value in Col4 as D22 which is not there for date 01/01/2022 so 30 should be there. similary row 13,14,15 does not have corresponding value for date 01/01/2021. So diff col should have 60 .
Thanks & Regards, R
CodePudding user response:
From the extra details in the comments section, it seems that we need to use some conditional logic because of the special status of the initial date in the data frame. One way to handle this is to use group_split
and map
library(tidyverse)
df %>%
mutate(rn = row_number()) %>%
group_split(col1, col2, col3, col4) %>%
map(~ if(nrow(.x) > 1) {
mutate(.x, diff = c(0, diff(col6)))
} else {
mutate(.x, diff = if(col5 == '1/01/2021') 0 else col6)
}) %>%
bind_rows() %>%
arrange(rn) %>%
select(-rn)
#> # A tibble: 15 x 7
#> col1 col2 col3 col4 col5 col6 diff
#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 A1 B1 C1 D1 1/01/2021 10 0
#> 2 A1 B2 C2 D2 1/01/2021 20 0
#> 3 A1 B3 C3 D22 1/01/2021 30 0
#> 4 A2 B1 C1 D4 1/01/2021 40 0
#> 5 A2 B2 C2 D5 1/01/2021 50 0
#> 6 A2 B3 C3 D6 1/01/2021 60 0
#> 7 A1 B1 C1 D1 1/01/2022 100 90
#> 8 A1 B2 C2 D2 1/01/2022 200 180
#> 9 A1 B3 C3 D3 1/01/2022 300 300
#> 10 A2 B1 C1 D4 1/01/2022 400 360
#> 11 A2 B2 C2 D5 1/01/2022 500 450
#> 12 A2 B3 C3 D6 1/01/2022 600 540
#> 13 A2 B1 C1 D7 1/01/2022 60 60
#> 14 A2 B2 C2 D8 1/01/2022 60 60
#> 15 A2 B3 C3 D9 1/01/2022 60 60
Created on 2022-11-21 with reprex v2.0.2
CodePudding user response:
Here's a solution without group_map
. I hope this does what you want, at least it reproduces the output for your example data.
I assume you want to group only by the first two letters of col4
so I split the column into col4a
and col4b
grouped by col4a
and arranged by col4b
and col5
. It seems you want to use 0
as a default value in the lag only for the values in the second year and input 0
in the first year (except when you have multiple values in the first year)
Also the way your final data is ordered is not very intuitive.
df12 <- df %>%
mutate(
col4a = str_sub(col4, 1, 2),
col4b = str_sub(col4, 3, 3)
) %>%
group_by(col1, col2, col3, col4a) %>%
arrange(col5, col4b) %>%
mutate(
dfiff = col6 - lag(col6, default=NA),
dfiff = coalesce(dfiff, if_else(col5=="1/01/2021" & col4b=="", 0, col6))
) %>%
ungroup() %>%
arrange(col5, col4a, col1, col3, col2) %>%
select(-col4a, -col4b)
I hope this helps, some other remarks:
I highly recommend not using strings as dates, especially when the date format is one in which alphabetical sorting does not correspond to the order of the dates (here it works accidentally because you always use "1/01" of each year, but you might get some other data and then your code would produce wrong output), you could use strptime
with the format "%m/%d/%Y"
to parse the dates from your column (assuming the one digit number is the month and the two digit number is the day).
Please try to describe better what you wish to accomplish, just a short paragraph is enough in most cases. The example input and output helps, thanks for posting it. But it takes way less time to answer your question if one has more context and does not need to reverse engineer the whole logic from input and output and more people will answer more quickly.