I´m trying to do a column sum using apply and I would like to know if you could give me some advice to understand the best way to repete the same opperation but changing the names of the variables without having to write several lines.
I have a dataframe that looks like this but with several more variables:
a<- 100:300
b<- 50:250
c<- 200:400
d<- 0:200
e<- 300:500
df<-tibble(a,a*2, a/2, b,b*2, b/2, c,c*2, c/2, d,d*2, d/2,e,e*2, e/2)
colnames(df)<-c("A_2000", "A_2001", "A_2002", "B_2000","B_2001", "B_2002",
"C_2000","C_2001", "C_2002", "D_2000","D_2001", "D_2002",
"E_2000","E_2001", "E_2002")
This is the command I´m using to sum one set of columns:
df$A<-apply(cbind(df$A_2000,df$B_2001,df$C_2002), 1, sum)
So, my question is : is there a way to do the same for columns B, C, D and E without writing the same command 4 times? I appreciate in this case writing 4 lines is not that bad, but I have to do this for 50 different sets of variables. Appreciate the help.
CodePudding user response:
We could split the data into list
of data.frame based on the substring of column name i.e. removing the _
followed by digits (\\d
) and then loop over the list and apply rowSums
df <- cbind(df, sapply(split.default(df, trimws(names(df),
whitespace = "_\\d ")), rowSums, na.rm = TRUE))
-output
head(df)
A_2000 A_2001 A_2002 B_2000 B_2001 B_2002 C_2000 C_2001 C_2002 D_2000 D_2001 D_2002 E_2000 E_2001 E_2002 A B C D E
1 100 200 50.0 50 100 25.0 200 400 100.0 0 0 0.0 300 600 150.0 350.0 175.0 700.0 0.0 1050.0
2 101 202 50.5 51 102 25.5 201 402 100.5 1 2 0.5 301 602 150.5 353.5 178.5 703.5 3.5 1053.5
3 102 204 51.0 52 104 26.0 202 404 101.0 2 4 1.0 302 604 151.0 357.0 182.0 707.0 7.0 1057.0
4 103 206 51.5 53 106 26.5 203 406 101.5 3 6 1.5 303 606 151.5 360.5 185.5 710.5 10.5 1060.5
5 104 208 52.0 54 108 27.0 204 408 102.0 4 8 2.0 304 608 152.0 364.0 189.0 714.0 14.0 1064.0
CodePudding user response:
Thank you Akrun. That works nicely. I don´t think I fully understand what "whitespace" is doing. For example, what if I need an additional level of split for the column names. Let´s say I have a column with the number of deaths by cancer, dementia and other, by year and sex. So my variable names are:
cancer_male_2000, dementia_male_2000,other_male_2000, cancer_male_2001, dementia_male_2001,other_male_2001, cancer_male_2002, dementia_male_2002,other_male_2002, ...(same for female)
And I need to calculate the RowSum of all deaths (cancer dementia other) for each sex and year. How would you split the column names?