Home > other >  How to create a dummy in dataframe according to value in another dataframe with a different length o
How to create a dummy in dataframe according to value in another dataframe with a different length o

Time:10-21

thanks in advance for reading my question.

In R I have the following two data frames (Df1 and Df2) - they're examples I have just made up - and I intend to create a dummy variable according to the following rule: for each id in Df1, if the year of this observation id is higher or equal to the observation id of year in Df2, the dummy takes a value of 1, and 0 otherwise. The data frame Df3 is the result I wish to achieve. How can I do it? Df1

id year x1
1 2017 0.3
1 2018 0.5
1 2019 0.45
1 2020 0.5
1 2021 0.6
2 2018 0.2
2 2019 0.3
2 2020 0.4
2 2021 0.5

Df2

id year
1 2019
2 2020

Df3

id year x1 dummy
1 2017 0.3 0
1 2018 0.5 0
1 2019 0.45 1
1 2020 0.5 1
1 2021 0.6 1
2 2018 0.5 0
2 2019 0.45 0
2 2020 0.5 1
2 2021 0.6 1

Some context:

I've tried creating two loops for the real data frame I am working on. Below is the code I have tried. The data frame is called data_school and my id, year, and dummy variables are id_escola, ano, and internet_fixa, respectively. I did a full join between my two initial data frames as a result I got data_school. Since it was a many to one join, I created the dummy variable and only the exact matches have values equal to 1 and everything else is a NA. Then I proceeded to do the following loop first iterating by all id and getting for each id, the year of reference for the dummy, and then iterating for each unique year of that id and replacing it according to the rule. For the first rows it works well, but after some rows, it gets the following error "Error in if (data_school[data_school$id_escola == id & data_school$ano == : argument is of length zero" What should I do?

for (id in unique(data_school$id_escola)) {
  current_subset <- subset(data_school, id_escola == id & is.na(internet_fixa) == F)
  year_implementation <- current_subset$ano
  current <- subset(data_school, id_escola == id)
  for (i in unique(current$ano)){
    if (data_school[data_school$id_escola == id & data_school$ano == i,]$ano < year_implementation) {
      data_school[data_school$id_escola == id & data_school$ano == i, "internet_fixa"] <- 0
    } else {
      data_school[data_school$id_escola == id & data_school$ano == i, "internet_fixa"] <- 1
    }
  }
}

P.S.: If you wish, you can ignore the last part (some context) if it is not clear enough.

CodePudding user response:

Does this work:

library(dplyr)
df2 %>% rename('df2_year' = year) %>% left_join(df1, by = 'id') %>% group_by(id) %>% mutate(dummy = if_else(year >= df2_year, 1, 0)) %>% select(-df2_year)
# A tibble: 6 x 4
# Groups:   id [2]
     id  year    x1 dummy
  <int> <int> <dbl> <dbl>
1     1  2017  0.3      0
2     1  2018  0.5      0
3     1  2019  0.45     1
4     1  2020  0.5      1
5     1  2021  0.6      1
6     2    NA NA       NA

Data used:

df1
  id year   x1
1  1 2017 0.30
2  1 2018 0.50
3  1 2019 0.45
4  1 2020 0.50
5  1 2021 0.60
df2
  id year
1  1 2019
2  2 2020
  • id = 2 is missing in df1 in your sample data.

CodePudding user response:

We may use a join with data.table

library(data.table)
setDT(df1)[df2, dummy :=  (year >= i.year), on = .(id)]

-output

> df1
   id year   x1 dummy
1:  1 2017 0.30     0
2:  1 2018 0.50     0
3:  1 2019 0.45     1
4:  1 2020 0.50     1
5:  1 2021 0.60     1

data

df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L), year = 2017:2021, 
    x1 = c(0.3, 0.5, 0.45, 0.5, 0.6)), class = "data.frame", row.names = c(NA, 
-5L))

df2 <- structure(list(id = 1:2, year = 2019:2020),
 class = "data.frame", row.names = c(NA, 
-2L))
  • Related