Home > Software design >  Assigning column values in for loops -- too slow
Assigning column values in for loops -- too slow

Time:07-08

I have a for loop that I'm trying to run that is quite slow when I apply it to a dataset with 100k observations. What this code does is uses information from one column (df$country) that describes a country assigned to a particular ID (e.g., ID == 1 and country == Japan), and changes the column value with the corresponding column name (e.g., a column named "Japan") equal to 1.

Sample data (dput()):

structure(list(id = c(1, 2, 3, 4, 5, 6), country = c("USA", "Japan",  "Germany", "Japan", "Japan", "Germany"), USA = c(0, 0, 0, 0,  0, 0), Japan = c(0, 0, 0, 0, 0, 0), Germany = c(0, 0, 0, 0, 0,  0)), row.names = c(NA, 6L), class = "data.frame")

The code is below:

#Assign vector of column names of my dataframe, 
#all named after countries (i.e. "Japan"). 
cols <- names(df[3:5]) 

#For each ID, for each column name,
#if ID == j and country == column name,
#Change entry in this row under column name to be unity.

  for(j in df$id){
    for(c in cols){
      df[df$id == j & df$country == c, c] <- 1
    }}

The code is way too slow to handle. It's been running for 20 minutes on 100k observations and still has not been completed. Are there any ways I can speed up this process? Thank you!

CodePudding user response:

You could loop over the columns instead of the rows:

for (col in cols) df[[col]] =  (df$country == col)

#   id country USA Japan Germany
# 1  1     USA   1     0       0
# 2  2   Japan   0     1       0
# 3  3 Germany   0     0       1
# 4  4   Japan   0     1       0
# 5  5   Japan   0     1       0
# 6  6 Germany   0     0       1

Also R has a function (model.matrix) to do exactly this:

df[levels(factor(df$country))] = model.matrix(~country - 1, df)

CodePudding user response:

You could use pivot_wider to do it all in one go:

library(tidyverse)

df |>
  mutate(value = 1) |>
  pivot_wider(id,
              names_from = "country",
              values_fill = 0) |>
  select(-id)

Output:

# A tibble: 6 × 3
    USA Japan Germany
  <dbl> <dbl>   <dbl>
1     1     0       0
2     0     1       0
3     0     0       1
4     0     1       0
5     0     1       0
6     0     0       1

Data:

df <- as.data.frame(structure(list(id = c(1, 2, 3, 4, 5, 6), country = c("USA", "Japan", "Germany", "Japan", "Japan", "Germany"))))
  • Related