Home > OS >  using mutate within a for loop (new variable name contains strings I am looping through)
using mutate within a for loop (new variable name contains strings I am looping through)

Time:07-20

I have a dataset with a population variable, as well as a few races ("white", "black", "hispanic"), and I want to be able to loop through the races so that for each race, a "percent_race" variable is created ("percent_white", etc.), and the race variable is then dropped.

I am most familiar with stata, where you can designate the string you are looping through within the loop using a `'. This allows me to name the new variables using a string from my loop that also serves to indicate what variables should be used in the formula for calculating those new variables. Here is what I mean:

loc races white black hispanic

foreach race in races {
   generate `race'_percentage = (population/`race')*100
   drop `race'
   }

In R, I want something to the same effect:

races <- list("white", "black", "hispanic")

df %>%
   for (race in races) {
      mutate(percent_"race" = (population/race)*100) %>%
      select(df, -c(race)) %>%
      }

I threw the quotes around race when naming the variable as a filler; I know that doesn't work, but you see how I want the variables to be named.

There might be other things wrong with how I am approaching this in R. I've always done data transformation and analysis in stata and moved to R for visualization, but I'm trying to learn to do it all in R. I'm not even sure if using a for loop within a pipe is proper here, but it makes sense to me within this little problem I have created for myself.

CodePudding user response:

Your stata code implies a certain structure of df, namely, that there are separate columns for white, black, and hispanic. In that case, the structure should look something like the sample data I have constructed below, and suggests that you can use mutate(across()) to transform the three variables.

races <- c("white", "black", "hispanic")
df %>% 
  mutate(across(all_of(races), ~.x*100/population,.names = "percent_{.col}")) %>%
  select(-all_of(races))

Output:

   population percent_white percent_black percent_hispanic
1       71662     96.303480     0.5288716         3.167648
2       77869     90.231029     4.0503923         5.718579
3       22985     69.071133    12.7996519        18.129215
4       49924     79.546911     7.5454691        12.907620
5       88292      2.462284    14.8699769        82.667739
6       82554     47.779635     7.2485888        44.971776
7       65403     75.846674     5.6297112        18.523615
8       85160     21.641616    36.5124472        41.845937
9       66434     31.819550    18.1352922        50.045158
10      29641     23.163861    65.9154549        10.920684

Input:

set.seed(123)
df = data.frame(population=sample(20000:100000, size = 10)) %>% 
  mutate(
    white = ceiling(population*runif(10)),
    black = ceiling((population-white)*runif(10)),
    hispanic = population-white-black
)

   population white black hispanic
1       71662 69013   379     2270
2       77869 70262  3154     4453
3       22985 15876  2942     4167
4       49924 39713  3767     6444
5       88292  2174 13129    72989
6       82554 39444  5984    37126
7       65403 49606  3682    12115
8       85160 18430 31094    35636
9       66434 21139 12048    33247
10      29641  6866 19538     3237

CodePudding user response:

It's atypical if not explicitly unallowed to pipe a data frame into a for loop like that. A more typical and tidy way would be something like reshaping the data to summarize:

df <- data.frame(
  id = c('1', '2', '3'),
  population = c(100, 200, 300),
  white = c(50, 75, 100),
  black = c(25, 50, 150),
  hispanic = c(25, 75, 50)
)

df %>%
  tidyr::pivot_longer(!c(id, population)) %>%
  dplyr::mutate(percent = value/population) %>% 
  tidyr::pivot_wider(c(id, population), names_from = name, names_prefix = "percent_")

This code takes the wide data, reshapes it to long (so each 'id/race' combination is unique), calculates the percent, and then goes back to a wide format with the names percent_'race'.

  • Related