I have a for loop that I'm trying to run that is quite slow when I apply it to a dataset with 100k observations. What this code does is uses information from one column (df$country
) that describes a country assigned to a particular ID (e.g., ID == 1 and country == Japan), and changes the column value with the corresponding column name (e.g., a column named "Japan") equal to 1.
Sample data (dput()):
structure(list(id = c(1, 2, 3, 4, 5, 6), country = c("USA", "Japan", "Germany", "Japan", "Japan", "Germany"), USA = c(0, 0, 0, 0, 0, 0), Japan = c(0, 0, 0, 0, 0, 0), Germany = c(0, 0, 0, 0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")
The code is below:
#Assign vector of column names of my dataframe,
#all named after countries (i.e. "Japan").
cols <- names(df[3:5])
#For each ID, for each column name,
#if ID == j and country == column name,
#Change entry in this row under column name to be unity.
for(j in df$id){
for(c in cols){
df[df$id == j & df$country == c, c] <- 1
}}
The code is way too slow to handle. It's been running for 20 minutes on 100k observations and still has not been completed. Are there any ways I can speed up this process? Thank you!
CodePudding user response:
You could loop over the columns instead of the rows:
for (col in cols) df[[col]] = (df$country == col)
# id country USA Japan Germany
# 1 1 USA 1 0 0
# 2 2 Japan 0 1 0
# 3 3 Germany 0 0 1
# 4 4 Japan 0 1 0
# 5 5 Japan 0 1 0
# 6 6 Germany 0 0 1
Also R has a function (model.matrix
) to do exactly this:
df[levels(factor(df$country))] = model.matrix(~country - 1, df)
CodePudding user response:
You could use pivot_wider
to do it all in one go:
library(tidyverse)
df |>
mutate(value = 1) |>
pivot_wider(id,
names_from = "country",
values_fill = 0) |>
select(-id)
Output:
# A tibble: 6 × 3
USA Japan Germany
<dbl> <dbl> <dbl>
1 1 0 0
2 0 1 0
3 0 0 1
4 0 1 0
5 0 1 0
6 0 0 1
Data:
df <- as.data.frame(structure(list(id = c(1, 2, 3, 4, 5, 6), country = c("USA", "Japan", "Germany", "Japan", "Japan", "Germany"))))