Home > database >  Changing values of many columns at once -- model.matrix()?
Changing values of many columns at once -- model.matrix()?

Time:07-08

Here is dput() of a structure I currently have.

structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan",  "Germany", "Germany", "USA"), USA = c(0, 0, 0, 0, 0), Germany = c(0,  0, 0, 0, 0), Japan = c(0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA,  -5L))

I want to edit this dataframe to get the below results in order to apply this approach to a dataset with 100k observations. Specifically, I want to use information from (df$country) that describes a country assigned to a particular ID (e.g., id == 1 and country == Japan), and changes the column value with the corresponding column name (e.g., a column named "Japan") equal to 1. Note that IDs are not unique!

This is what I'd like to end up with:

structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan",  "Germany", "Germany", "USA"), USA = c(1, 1, 0, 1, 1), Germany = c(0,  0, 1, 1, 1), Japan = c(1, 1, 0, 0, 0)), class = "data.frame", row.names = c(NA,  -5L))

The following code gives a close result:

df[levels(factor(df$country))] = model.matrix(~country - 1, df)

But ends up giving me the following, erroneous structure:

structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan", 
"Germany", "Germany", "USA"), USA = c(1, 0, 0, 0, 1), Germany = c(0, 
0, 1, 1, 0), Japan = c(0, 1, 0, 0, 0)), row.names = c(NA, -5L
), class = "data.frame") 

How can I edit the above command in order to yield my desired result? I cannot use pivot because, in actuality, I'm working with many datasets that have different values in the "country" column that, once pivoted, will yield datasets with non-uniform columns/structures, which will impede data analysis later on.

Thank you for any help!

CodePudding user response:

Perhaps this helps

library(dplyr)
df %>% 
    mutate(across(USA:Japan, ~   (country == cur_column()))) %>% 
    group_by(id) %>% 
    mutate(across(USA:Japan, max)) %>% 
    ungroup

-output

# A tibble: 5 × 5
     id country   USA Germany Japan
  <dbl> <chr>   <int>   <int> <int>
1     1 USA         1       0     1
2     1 Japan       1       0     1
3     2 Germany     0       1     0
4     4 Germany     1       1     0
5     4 USA         1       1     0

Or modifying the model.matrix as

m1 <- model.matrix(~country - 1, df)
m1[] <- ave(c(m1), df$id[row(m1)], col(m1), FUN = max)

CodePudding user response:

You can use base R

re <- rle(df$id)
for(j in re$values){
    y <- which(j == df$id)
        df[y , match(df$country[y] , colnames(df))] <- 1
}
  • Output
  id country USA Germany Japan
1  1     USA   1       0     1
2  1   Japan   1       0     1
3  2 Germany   0       1     0
4  4 Germany   1       1     0
5  4     USA   1       1     0
  • Related