Home > Back-end >  Column sum changing name of variables in r
Column sum changing name of variables in r

Time:10-22

I´m trying to do a column sum using apply and I would like to know if you could give me some advice to understand the best way to repete the same opperation but changing the names of the variables without having to write several lines.

I have a dataframe that looks like this but with several more variables:

a<- 100:300
b<- 50:250
c<- 200:400
d<- 0:200
e<- 300:500
df<-tibble(a,a*2, a/2, b,b*2, b/2, c,c*2, c/2, d,d*2, d/2,e,e*2, e/2)
colnames(df)<-c("A_2000", "A_2001", "A_2002", "B_2000","B_2001", "B_2002",
                "C_2000","C_2001", "C_2002", "D_2000","D_2001", "D_2002",
                "E_2000","E_2001", "E_2002")

This is the command I´m using to sum one set of columns:

df$A<-apply(cbind(df$A_2000,df$B_2001,df$C_2002), 1, sum)

So, my question is : is there a way to do the same for columns B, C, D and E without writing the same command 4 times? I appreciate in this case writing 4 lines is not that bad, but I have to do this for 50 different sets of variables. Appreciate the help.

CodePudding user response:

We could split the data into list of data.frame based on the substring of column name i.e. removing the _ followed by digits (\\d ) and then loop over the list and apply rowSums

df <- cbind(df, sapply(split.default(df, trimws(names(df),
          whitespace = "_\\d ")), rowSums, na.rm = TRUE))

-output

head(df)
  A_2000 A_2001 A_2002 B_2000 B_2001 B_2002 C_2000 C_2001 C_2002 D_2000 D_2001 D_2002 E_2000 E_2001 E_2002      A     B      C     D      E
1      100    200   50.0     50    100   25.0    200    400  100.0      0      0    0.0    300    600  150.0  350.0 175.0  700.0   0.0 1050.0
2      101    202   50.5     51    102   25.5    201    402  100.5      1      2    0.5    301    602  150.5  353.5 178.5  703.5   3.5 1053.5
3      102    204   51.0     52    104   26.0    202    404  101.0      2      4    1.0    302    604  151.0  357.0 182.0  707.0   7.0 1057.0
4      103    206   51.5     53    106   26.5    203    406  101.5      3      6    1.5    303    606  151.5  360.5 185.5  710.5  10.5 1060.5
5      104    208   52.0     54    108   27.0    204    408  102.0      4      8    2.0    304    608  152.0  364.0 189.0  714.0  14.0 1064.0

CodePudding user response:

Thank you Akrun. That works nicely. I don´t think I fully understand what "whitespace" is doing. For example, what if I need an additional level of split for the column names. Let´s say I have a column with the number of deaths by cancer, dementia and other, by year and sex. So my variable names are:

cancer_male_2000, dementia_male_2000,other_male_2000, cancer_male_2001, dementia_male_2001,other_male_2001, cancer_male_2002, dementia_male_2002,other_male_2002, ...(same for female)

And I need to calculate the RowSum of all deaths (cancer dementia other) for each sex and year. How would you split the column names?

  • Related