thanks in advance for reading my question.
In R I have the following two data frames (Df1 and Df2) - they're examples I have just made up - and I intend to create a dummy variable according to the following rule: for each id in Df1, if the year of this observation id is higher or equal to the observation id of year in Df2, the dummy takes a value of 1, and 0 otherwise. The data frame Df3 is the result I wish to achieve. How can I do it? Df1
id | year | x1 |
---|---|---|
1 | 2017 | 0.3 |
1 | 2018 | 0.5 |
1 | 2019 | 0.45 |
1 | 2020 | 0.5 |
1 | 2021 | 0.6 |
2 | 2018 | 0.2 |
2 | 2019 | 0.3 |
2 | 2020 | 0.4 |
2 | 2021 | 0.5 |
Df2
id | year |
---|---|
1 | 2019 |
2 | 2020 |
Df3
id | year | x1 | dummy |
---|---|---|---|
1 | 2017 | 0.3 | 0 |
1 | 2018 | 0.5 | 0 |
1 | 2019 | 0.45 | 1 |
1 | 2020 | 0.5 | 1 |
1 | 2021 | 0.6 | 1 |
2 | 2018 | 0.5 | 0 |
2 | 2019 | 0.45 | 0 |
2 | 2020 | 0.5 | 1 |
2 | 2021 | 0.6 | 1 |
Some context:
I've tried creating two loops for the real data frame I am working on. Below is the code I have tried. The data frame is called data_school and my id, year, and dummy variables are id_escola, ano, and internet_fixa, respectively. I did a full join between my two initial data frames as a result I got data_school. Since it was a many to one join, I created the dummy variable and only the exact matches have values equal to 1 and everything else is a NA. Then I proceeded to do the following loop first iterating by all id and getting for each id, the year of reference for the dummy, and then iterating for each unique year of that id and replacing it according to the rule. For the first rows it works well, but after some rows, it gets the following error "Error in if (data_school[data_school$id_escola == id & data_school$ano == : argument is of length zero" What should I do?
for (id in unique(data_school$id_escola)) {
current_subset <- subset(data_school, id_escola == id & is.na(internet_fixa) == F)
year_implementation <- current_subset$ano
current <- subset(data_school, id_escola == id)
for (i in unique(current$ano)){
if (data_school[data_school$id_escola == id & data_school$ano == i,]$ano < year_implementation) {
data_school[data_school$id_escola == id & data_school$ano == i, "internet_fixa"] <- 0
} else {
data_school[data_school$id_escola == id & data_school$ano == i, "internet_fixa"] <- 1
}
}
}
P.S.: If you wish, you can ignore the last part (some context) if it is not clear enough.
CodePudding user response:
Does this work:
library(dplyr)
df2 %>% rename('df2_year' = year) %>% left_join(df1, by = 'id') %>% group_by(id) %>% mutate(dummy = if_else(year >= df2_year, 1, 0)) %>% select(-df2_year)
# A tibble: 6 x 4
# Groups: id [2]
id year x1 dummy
<int> <int> <dbl> <dbl>
1 1 2017 0.3 0
2 1 2018 0.5 0
3 1 2019 0.45 1
4 1 2020 0.5 1
5 1 2021 0.6 1
6 2 NA NA NA
Data used:
df1
id year x1
1 1 2017 0.30
2 1 2018 0.50
3 1 2019 0.45
4 1 2020 0.50
5 1 2021 0.60
df2
id year
1 1 2019
2 2 2020
- id = 2 is missing in df1 in your sample data.
CodePudding user response:
We may use a join with data.table
library(data.table)
setDT(df1)[df2, dummy := (year >= i.year), on = .(id)]
-output
> df1
id year x1 dummy
1: 1 2017 0.30 0
2: 1 2018 0.50 0
3: 1 2019 0.45 1
4: 1 2020 0.50 1
5: 1 2021 0.60 1
data
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L), year = 2017:2021,
x1 = c(0.3, 0.5, 0.45, 0.5, 0.6)), class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(id = 1:2, year = 2019:2020),
class = "data.frame", row.names = c(NA,
-2L))