Here is dput() of a structure I currently have.
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan", "Germany", "Germany", "USA"), USA = c(0, 0, 0, 0, 0), Germany = c(0, 0, 0, 0, 0), Japan = c(0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, -5L))
I want to edit this dataframe to get the below results in order to apply this approach to a dataset with 100k observations. Specifically, I want to use information from (df$country
) that describes a country assigned to a particular ID (e.g., id == 1
and country == Japan
), and changes the column value with the corresponding column name (e.g., a column named "Japan") equal to 1. Note that IDs are not unique!
This is what I'd like to end up with:
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan", "Germany", "Germany", "USA"), USA = c(1, 1, 0, 1, 1), Germany = c(0, 0, 1, 1, 1), Japan = c(1, 1, 0, 0, 0)), class = "data.frame", row.names = c(NA, -5L))
The following code gives a close result:
df[levels(factor(df$country))] = model.matrix(~country - 1, df)
But ends up giving me the following, erroneous structure:
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan",
"Germany", "Germany", "USA"), USA = c(1, 0, 0, 0, 1), Germany = c(0,
0, 1, 1, 0), Japan = c(0, 1, 0, 0, 0)), row.names = c(NA, -5L
), class = "data.frame")
How can I edit the above command in order to yield my desired result? I cannot use pivot because, in actuality, I'm working with many datasets that have different values in the "country" column that, once pivoted, will yield datasets with non-uniform columns/structures, which will impede data analysis later on.
Thank you for any help!
CodePudding user response:
Perhaps this helps
library(dplyr)
df %>%
mutate(across(USA:Japan, ~ (country == cur_column()))) %>%
group_by(id) %>%
mutate(across(USA:Japan, max)) %>%
ungroup
-output
# A tibble: 5 × 5
id country USA Germany Japan
<dbl> <chr> <int> <int> <int>
1 1 USA 1 0 1
2 1 Japan 1 0 1
3 2 Germany 0 1 0
4 4 Germany 1 1 0
5 4 USA 1 1 0
Or modifying the model.matrix
as
m1 <- model.matrix(~country - 1, df)
m1[] <- ave(c(m1), df$id[row(m1)], col(m1), FUN = max)
CodePudding user response:
You can use base R
re <- rle(df$id)
for(j in re$values){
y <- which(j == df$id)
df[y , match(df$country[y] , colnames(df))] <- 1
}
- Output
id country USA Germany Japan
1 1 USA 1 0 1
2 1 Japan 1 0 1
3 2 Germany 0 1 0
4 4 Germany 1 1 0
5 4 USA 1 1 0