I have a dataset with a population variable, as well as a few races ("white", "black", "hispanic"), and I want to be able to loop through the races so that for each race, a "percent_race" variable is created ("percent_white", etc.), and the race variable is then dropped.
I am most familiar with stata, where you can designate the string you are looping through within the loop using a `'. This allows me to name the new variables using a string from my loop that also serves to indicate what variables should be used in the formula for calculating those new variables. Here is what I mean:
loc races white black hispanic
foreach race in races {
generate `race'_percentage = (population/`race')*100
drop `race'
}
In R, I want something to the same effect:
races <- list("white", "black", "hispanic")
df %>%
for (race in races) {
mutate(percent_"race" = (population/race)*100) %>%
select(df, -c(race)) %>%
}
I threw the quotes around race when naming the variable as a filler; I know that doesn't work, but you see how I want the variables to be named.
There might be other things wrong with how I am approaching this in R. I've always done data transformation and analysis in stata and moved to R for visualization, but I'm trying to learn to do it all in R. I'm not even sure if using a for loop within a pipe is proper here, but it makes sense to me within this little problem I have created for myself.
CodePudding user response:
Your stata code implies a certain structure of df
, namely, that there are separate columns for white
, black
, and hispanic
. In that case, the structure should look something like the sample data I have constructed below, and suggests that you can use mutate(across())
to transform the three variables.
races <- c("white", "black", "hispanic")
df %>%
mutate(across(all_of(races), ~.x*100/population,.names = "percent_{.col}")) %>%
select(-all_of(races))
Output:
population percent_white percent_black percent_hispanic
1 71662 96.303480 0.5288716 3.167648
2 77869 90.231029 4.0503923 5.718579
3 22985 69.071133 12.7996519 18.129215
4 49924 79.546911 7.5454691 12.907620
5 88292 2.462284 14.8699769 82.667739
6 82554 47.779635 7.2485888 44.971776
7 65403 75.846674 5.6297112 18.523615
8 85160 21.641616 36.5124472 41.845937
9 66434 31.819550 18.1352922 50.045158
10 29641 23.163861 65.9154549 10.920684
Input:
set.seed(123)
df = data.frame(population=sample(20000:100000, size = 10)) %>%
mutate(
white = ceiling(population*runif(10)),
black = ceiling((population-white)*runif(10)),
hispanic = population-white-black
)
population white black hispanic
1 71662 69013 379 2270
2 77869 70262 3154 4453
3 22985 15876 2942 4167
4 49924 39713 3767 6444
5 88292 2174 13129 72989
6 82554 39444 5984 37126
7 65403 49606 3682 12115
8 85160 18430 31094 35636
9 66434 21139 12048 33247
10 29641 6866 19538 3237
CodePudding user response:
It's atypical if not explicitly unallowed to pipe a data frame into a for loop like that. A more typical and tidy way would be something like reshaping the data to summarize:
df <- data.frame(
id = c('1', '2', '3'),
population = c(100, 200, 300),
white = c(50, 75, 100),
black = c(25, 50, 150),
hispanic = c(25, 75, 50)
)
df %>%
tidyr::pivot_longer(!c(id, population)) %>%
dplyr::mutate(percent = value/population) %>%
tidyr::pivot_wider(c(id, population), names_from = name, names_prefix = "percent_")
This code takes the wide data, reshapes it to long (so each 'id/race' combination is unique), calculates the percent, and then goes back to a wide format with the names percent_'race'.