I am trying to create some code to determine if a certain statement is true for three consecutive years. This statement is, for example, that the amount needs to be negative for three consecutive years. Only if that is true, the third row should be marked in a new column with a 1, otherwise, the rows should be marked as a 0.
To further elaborate on what I try to do, I've created the following dataset:
ID | Year | Amount |
---|---|---|
001 | 2022 | 2 |
001 | 2021 | -9 |
001 | 2020 | -10 |
001 | 2019 | -16 |
001 | 2018 | -20 |
002 | 2022 | 400 |
002 | 2021 | 300 |
002 | 2020 | -200 |
002 | 2019 | -600 |
002 | 2018 | -500 |
And the outcome should look somewhat like this:
ID | Year | Amount | Y/N |
---|---|---|---|
001 | 2022 | 2 | 0 |
001 | 2021 | -9 | 1 |
001 | 2020 | -10 | 1 |
001 | 2019 | -16 | 0 |
001 | 2018 | -20 | 0 |
002 | 2022 | 400 | 0 |
002 | 2021 | 300 | 0 |
002 | 2020 | -200 | 1 |
002 | 2019 | -600 | 0 |
002 | 2018 | -500 | 0 |
Does anybody know how I could code this?
CodePudding user response:
This should do it :
df <- read.table(h=T, text="ID Year Amount
001 2022 2
001 2021 -9
001 2020 -10
001 2019 -16
001 2018 -20
002 2022 400
002 2021 300
002 2020 -200
002 2019 -600
002 2018 -500")
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(YN = rev(diff(c(0,0,0, cumsum(rev(Amount)<0)), 3) == 3)) %>%
ungroup()
#> # A tibble: 10 × 4
#> ID Year Amount YN
#> <int> <int> <int> <lgl>
#> 1 1 2022 2 FALSE
#> 2 1 2021 -9 TRUE
#> 3 1 2020 -10 TRUE
#> 4 1 2019 -16 FALSE
#> 5 1 2018 -20 FALSE
#> 6 2 2022 400 FALSE
#> 7 2 2021 300 FALSE
#> 8 2 2020 -200 TRUE
#> 9 2 2019 -600 FALSE
#> 10 2 2018 -500 FALSE
Created on 2022-04-21 by the reprex package (v2.0.1)
In base R:
df$YN <- ave(df$Amount, df$ID, FUN = function(x) rev(diff(c(0,0,0, cumsum(rev(x)<0)), 3) == 3)) == 1
Here's a walkthrough :
df %>%
arrange(ID, Year) %>%
group_by(ID) %>%
mutate(
cond = Amount < 0,
cumsum = cumsum(cond),
diff = diff(c(0,0,0, cumsum), 3),
YN = diff == 3) # %>% arrange(ID, desc(Year)) # to put back in original shape
#> # A tibble: 10 × 7
#> # Groups: ID [2]
#> ID Year Amount cond cumsum diff YN
#> <int> <int> <int> <lgl> <int> <dbl> <lgl>
#> 1 1 2018 -20 TRUE 1 1 FALSE
#> 2 1 2019 -16 TRUE 2 2 FALSE
#> 3 1 2020 -10 TRUE 3 3 TRUE
#> 4 1 2021 -9 TRUE 4 3 TRUE
#> 5 1 2022 2 FALSE 4 2 FALSE
#> 6 2 2018 -500 TRUE 1 1 FALSE
#> 7 2 2019 -600 TRUE 2 2 FALSE
#> 8 2 2020 -200 TRUE 3 3 TRUE
#> 9 2 2021 300 FALSE 3 2 FALSE
#> 10 2 2022 400 FALSE 3 1 FALSE
And an easier solution to read, not generalisable to big spans:
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(YN = (Amount < 0) lead(Amount < 0, default = 0) lead(Amount < 0, 2, default = 0) == 3) %>%
ungroup()
CodePudding user response:
Using the dataset
df <- data.frame(id = c(rep(1,5,),rep(2,5)), year = rep(seq(2022,2018),2), amount = c(2,-9,-10,-16,-20,400,300,-200,-600,-500))
A sapply
function can be used:
df$test <- ifelse(sapply(seq(1,nrow(df)), function(x) all(subset(df, df$id == df$id[x] & df$year %in% seq(df$year[x]-2,df$year[x]))$amount < 0) & nrow(subset(df, df$id == df$id[x] & df$year %in% seq(df$year[x]-2,df$year[x]))) > 2),1,0)
That gives:
df
id year amount test
1 1 2022 2 0
2 1 2021 -9 1
3 1 2020 -10 1
4 1 2019 -16 0
5 1 2018 -20 0
6 2 2022 400 0
7 2 2021 300 0
8 2 2020 -200 1
9 2 2019 -600 0
10 2 2018 -500 0
Breakdown:
ifelse
to change the TRUE/FALSE from the SAPPLY to 1/0
sapply
to iterate per row
all
to test whether the condition holds for all values inside the subset
subset
to select the last three years and the correct id
$amount < 0
to test the condition
CodePudding user response:
You could use data.table package.
library(data.table)
setDT(df)
df[order(Year), YN := match(Reduce(` `, shift(Amount<0, 0:2)), 3, 0), ID]
# ID Year Amount YN
# 1: 1 2022 2 0
# 2: 1 2021 -9 1
# 3: 1 2020 -10 1
# 4: 1 2019 -16 0
# 5: 1 2018 -20 0
# 6: 2 2022 400 0
# 7: 2 2021 300 0
# 8: 2 2020 -200 1
# 9: 2 2019 -600 0
# 10: 2 2018 -500 0
data
df = structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
Year = c(2022, 2021, 2020, 2019, 2018, 2022, 2021, 2020, 2019, 2018),
Amount = c(2, -9, -10, -16, -20, 400, 300, -200, -600, -500)),
row.names = c(NA, -10L),
class = "data.frame")