I have a dataframe which contains >100 columns, some are numeric, some not.
All variables ending with "_f" or "_m" are numeric variables and I would like to sum all the pairs that start with the same pattern but end with "_f" or "_m".
Here is an example of variable names in my dataframe:
xxxxxxxxxxxxx_age1_f
xxxxxxxxxxxxx_age1_m
xxxxxxxxxxxxx_age2_f
xxxxxxxxxxxxx_age2_m
xxxxxxxxxxxxx_age3_f
xxxxxxxxxxxxx_age3_m
yyyyyyyyyy_age1_f
yyyyyyyyyy_age1_m
yyyyyyyyyy_age2_f
yyyyyyyyyy_age2_m
yyyyyyyyyy_age3_f
yyyyyyyyyy_age3_m
yyyyyyyyyy_age4_f
yyyyyyyyyy_age4_m
yyyyyyyyyy_age5_f
yyyyyyyyyy_age5_m
zzzzzzzzzzzzzzzzzzzz_age1_f
zzzzzzzzzzzzzzzzzzzz_age1_m
zzzzzzzzzzzzzzzzzzzz_age2_f
zzzzzzzzzzzzzzzzzzzz_age2_m
zzzzzzzzzzzzzzzzzzzz_age3_f
zzzzzzzzzzzzzzzzzzzz_age3_m
text_var_11
text_var_222
text_var_33333
(I'm abstracting the names here with x, y z to make my question clearer, they are not really named like that)
My first solution would be to sum each pair using dplyr::mutate()
like this:
mutate( ... ) %>%
mutate( yyyyyyyyyy_age2 = yyyyyyyyyy_age2_f yyyyyyyyyy_age2_m) %>%
mutate( yyyyyyyyyy_age3 = yyyyyyyyyy_age3_f yyyyyyyyyy_age3_m) %>%
mutate( ... ) %>%
This will work, but there must be more intelligent way to do this without repeating this for all variable pairs.
After looking for a solution, the closest I found was this Sum all columns whose names start with a pattern, by group
However the proposed solution doesn't work in my case for 2 reasons:
- the
substr()
is not applicable to my problem since the lengths of the variable names change - this method assumes I only have variables I want to sum, while in my case I have many other variables that don't end with "_f" or "_m" and don't need to be summed (and cannot be summed since some are text)
I suppose the solution could be modified to apply to my case but I'm not sure how.
EDIT: here is sample data created with dput
structure(list(Groups = c("xx", "xx", "xx"), xxxxx_age0_f = c(8,
0, 7), xxxxx_age0_m = c(5, 0, 0), xxxxx_age1_f = c(1,
0, 0), xxxxx_age1_m = c(3, 2, 0), xxxxx_age2_f = c(0,
0, 2), xxxxx_age2_m = c(0, 1, 0), zzzz_age0_f = c(4,
2, NA), zzzz_age0_m = c(3, 6, NA), zzzz_age1_f = c(0,
0, NA), zzzz_age1_m = c(2, 0, NA), zzzz_age2_f = c(4,
1, NA), zzzz_age2_m = c(3, 1, NA)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
ps: this simplified example has only one categorical variable, while I have hundreds.
CodePudding user response:
Updated, with OP's actual example data:
inner_join(
dat,
dat %>%
select(Groups, ends_with(c("_f", "_m"))) %>%
pivot_longer(cols=!Groups) %>%
mutate(name = gsub("_.$",replacement = "",name)) %>%
group_by(Groups, name) %>%
summarize(value=sum(value,na.rm=T)) %>%
pivot_wider(id_cols ="Groups", "name"),
by="Groups"
)
Previous example, prior to OP's actual example.
If your table looks like this:
dat
# A tibble: 2 x 9
zzzzzzzzzzzzzzzzzzzz_age1_f zzzzzzzzzzzzzzzzzzzz_age1_m zzzzzzzzzzzzzzzz~ zzzzzzzzzzzzzzz~ zzzzzzzzzzzzzzz~ zzzzzzzzzzzzzzz~ text_var_11 text_var_222 text_var_33333
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 -0.709 1.26 1.03 1.36 -0.140 -0.595 f o x
2 -0.202 0.164 -1.28 -1.48 -0.380 0.874 a p m
Then, you can do this:
dat %>%
select(ends_with(c("_f", "_m"))) %>%
pivot_longer(cols=everything()) %>%
mutate(name = gsub("_.$",replacement = "",name)) %>%
group_by(name) %>%
summarize(value=sum(value,na.rm=T))
To get this:
name value
<chr> <dbl>
1 zzzzzzzzzzzzzzzzzzzz_age1 0.510
2 zzzzzzzzzzzzzzzzzzzz_age2 -0.371
3 zzzzzzzzzzzzzzzzzzzz_age3 -0.240
Now, I don't know what your desired output structure is, but your mutate
attempt above suggests you want to column bind these new aggregate columns. This could be done easily by taking that interim result above, and wrapping it in bind_cols
, like this
bind_cols(
dat,
dat %>%
select(ends_with(c("_f", "_m"))) %>%
pivot_longer(cols=everything()) %>%
mutate(name = gsub("_.$",replacement = "",name)) %>%
group_by(name) %>%
summarize(value=sum(value,na.rm=T)) %>%
pivot_wider(id_cols ="name")
)