How to compute the mean for certain category rows?-CodePudding

Assuming this data:

df <- tibble(
word = c("apple", "apple","banana", "pear","pear","A","A","A"), 
i = seq_along(word),year=c(2000,2001,2000,2000,2001,2000,2001,2002)
)

I want to calculate the mean of column i for all words (per year) in column word without A and banana. and give the name out the output would be

      word       i  year
     <chr>  <int> <dbl>
     1 out      2.5  2000
     2 out      3.5  2001
     3 banana     4  2000
     5 A          7  2000
     8 A          8  2001
     9 A          9  2002

CodePudding user response：

We may group by 'year', and the 'word' after replaceing the elements that are not 'A', 'banana' with 'out' and get the mean of 'i' and order the rows by 'group' (arrange) if neccessary

library(dplyr)
df %>% 
  group_by(year, word = replace(word, !(word %in% c('A', 'banana')), 'out')) %>%
  summarise(i = mean(i), .groups = 'drop') %>%
  arrange(factor(word, levels = c('out', 'banana', 'A'))) %>%
  select(names(df))

CodePudding user response：

First we replace all not banana and A with out in an ifelse statement, then group and summarise. Finally arrange:

library(dplyr)
df %>% 
  mutate(word = ifelse(word != "banana" & word !="A", "out", word)) %>% 
  group_by(year, word) %>% 
  summarise(mean_i = mean(i)) %>% 
  arrange(desc(word))

   year word   mean_i
  <dbl> <chr>   <dbl>
1  2000 out       2.5
2  2001 out       3.5
3  2000 banana    3  
4  2000 A         6  
5  2001 A         7  
6  2002 A         8

CodePudding user response：

Using data.table:

library(data.table)
dt <- as.data.table(df)

# assign name "mean" to the new column with average (mean) values
> dt[!(word %in% c('banana', 'A') ), .(mean = mean(i)), by = .(year, word)]
   year  word mean
1: 2000 apple    1
2: 2001 apple    2
3: 2000  pear    4
4: 2001  pear    5

CodePudding user response：

Base R solution

We first split the data by word and, within word, year. Then give us the mean of i. Store the mean values of i pertaining to all words but 'A' and 'apple' in out. Finally, make a data.frame with columns for word, year, and our means and use some regex to obtain tidy column names.

means <- sapply(split(df, f = ~ df$word   df$year), \(x) mean(x$i))
out <- means[!grepl('^[Aa]', names(means))]
df_new <- data.frame(
  word=gsub('\\.[0-9]*', '', names(out)),
  year=gsub('[^[0-9]]*', '', names(out)),
  mean_i=out
)
rownames(df_new) <- NULL

Output

> df_new
    word year mean_i
1 banana 2000      3
2   pear 2000      4
3 banana 2001    NaN
4   pear 2001      5
5 banana 2002    NaN
6   pear 2002    NaN

If you want to omit rows containing missing values of mean_i, then you could use something like df_new[rowSums(apply(df_new, 2, is.na)) < 1, ]

> df_new[rowSums(apply(df_new, 2, is.na)) < 1, ]
    word year mean_i
1 banana 2000      3
2   pear 2000      4
4   pear 2001      5

Note: use function(x) instead of \(x) if you use a version of R <4.1.0

CodePudding user response：

Do you mean something like:

df %>% 
  filter(!(word %in% c('A', 'banana'))) %>%
  group_by(year, word) %>%
  summarize(Mean =mean(i))

Output:

# A tibble: 4 × 3
# Groups:   year [2]
   year word   Mean
  <dbl> <chr> <dbl>
1  2000 apple     1
2  2000 pear      4
3  2001 apple     2
4  2001 pear      5

Update: Inspired by akrun I slightly improved the filter.