I would like to add two columns in my dataframe, one column referring to the number of non-NA values within a subset of rows, and the other referring to the average value. Suppose I have the following data. Frame:
set. Seed(123)
df <- data.frame( id = 1:5,
var1_a = runif(1:5),
var1_b = runif(1:5),
var2_a = runif(1:5),
var2_b = runif(1:5))
df[, "var1_a"][ df[, "var1_a"] < 0.5 ] <- NA
df[, "var2_b"][ df[, "var2_b"] > 0.5 ] <- NA
Which is going to look like this:
id var1_a var1_b var2_a var2_b
1 1 NA 0.0455565 0.9568333 NA
2 2 0.7883051 0.5281055 0.4533342 0.24608773
3 3 NA 0.8924190 0.6775706 0.04205953
4 4 0.8830174 0.5514350 0.5726334 0.32792072
5 5 0.9404673 0.4566147 0.1029247 NA
I would like to get the new columns for two subsets of columns identified as var1 and var2. This is going to lead me with 4 new columns, var1_count and var2_count which counts number of non-NA cells in the subset var1 and var2; and then another two columns avg_var1 and avg_var2 which calculates the average per row of subset columns var1 and var2.
I found a similar problem in the following link:
Sum columns with similar names/prefixes in R
I tried to follow some solutions from this post:
list2DF(
tapply(
as.list(df),
gsub("\\..*", "", names(df)),
function(x) rowSums(list2DF(x))
)
)
I also tried:
cbind(df, sapply(split.default(df,
sub("\\..*", "", names(df))), rowSums))
I guess I don't understand what the \\..*
means and I am getting the prefixes wrong. Also, since this is an example, my actual variables are not called var1 or var2, they have different name. Like, two of these variables are called "brand_m1_A" and "Brand_m1_B". Does that change the solution?
CodePudding user response:
We could convert to 'long' format with pivot_longer
, do a group by summarise
to return the mean
and count of non-NA elements, and join with the original data. With pivot_longer
, capture the characters ((...)
) before the _
and remove the characters that comes after including the _
in column names to select only the 'var1', 'var2'
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('var'),
names_to = ".value", names_pattern = "(.*)_.*") %>%
group_by(id) %>%
summarise(across(everything(),
list(mean = ~ mean(.x, na.rm = TRUE),
count = ~ sum(complete.cases(.x)))), .groups = 'drop') %>%
left_join(df, .)
-output
id var1_a var1_b var2_a var2_b var1_mean var1_count var2_mean var2_count
1 1 NA 0.0455565 0.9568333 NA 0.0455565 1 0.9568333 1
2 2 0.7883051 0.5281055 0.4533342 0.24608773 0.6582053 2 0.3497109 2
3 3 NA 0.8924190 0.6775706 0.04205953 0.8924190 1 0.3598151 2
4 4 0.8830174 0.5514350 0.5726334 0.32792072 0.7172262 2 0.4502771 2
5 5 0.9404673 0.4566147 0.1029247 NA 0.6985410 2 0.1029247 1
With the split.default
, we need to use the regex _.*
to remove the characters from _
. In addition, select only the columns that are numeric, i.e. remove the first 'id' column ([-1]
)
cbind(df, lapply(split.default(df[-1], sub("_.*", "",
names(df[-1]))), \(x)
data.frame(mean = rowMeans(x, na.rm = TRUE), count = rowSums(!is.na(x)))))
-output
id var1_a var1_b var2_a var2_b var1.mean var1.count var2.mean var2.count
1 1 NA 0.0455565 0.9568333 NA 0.0455565 1 0.9568333 1
2 2 0.7883051 0.5281055 0.4533342 0.24608773 0.6582053 2 0.3497109 2
3 3 NA 0.8924190 0.6775706 0.04205953 0.8924190 1 0.3598151 2
4 4 0.8830174 0.5514350 0.5726334 0.32792072 0.7172262 2 0.4502771 2
5 5 0.9404673 0.4566147 0.1029247 NA 0.6985410 2 0.1029247 1