Is there a reproducible way of making a table of means from an existing table-CodePudding

I am a beginner at R so please bear with me.

A toy-example of my problem would be somewhat like this -

Fruits	Number
1_Fruit_A	18
2_Fruit_A	20
3_Fruit_A	24
1_Fruit_B	50
2_Fruit_B	63
3_Fruit_B	45
1_Fruit_C	71
2_Fruit_C	75
3_Fruit_C	72

And from this table, I wish to create another table wherein I store data somewhat like this -

Fruits	Means	Standard Deviation
Fruit A - Average	20.67	3
Fruit B - Average	52.67	2
Fruit C - Average	72.67	4

I need to write a reproducible code which I can use for say even coloured balls or flowers instead of fruits, but always a mean of triplicates which needs to be stored on another table with one column renamed and one column of means, I will then use ggplot to plot the resultant table. Any help would be greatly appreciated.

CodePudding user response：

You can extract the common value from Fruits column and use it as group and then take average value in each group.

For your example, you can drop everything after a -.

Using dplyr :

library(dplyr)

df %>%
  group_by(Fruits = sub('\\s-.*', '', Fruits)) %>%
  summarise(Number = mean(Number), .groups = 'drop')

#   Fruits   Number
#1 Fruit A 20.66667
#2 Fruit B 52.66667
#3 Fruit C 72.66667

In base R,

aggregate(Number~Fruits, transform(df, Fruits = sub('\\s-.*', '', Fruits)), mean)

CodePudding user response：

Using data.table

library(data.table)

# create a new column which includes common string to aggregate upon

setDT(dfs)[,':='(
  aggstr = substr(Fruits, 0, nchar(Fruits)-1)
  # OR using regex 
  # aggstr = sub('\\s-.*', '', Fruits)
)]
result.dt <- dfs[,.(Avg = mean(Number) , Sd= sd(Number)), by="aggstr"]

EDIT: added standard deviation

CodePudding user response：

Very similar to @Ronak Shah's solution, but here I include a reproducible data and customized text manipulation on Fruits column, based on your updated data:

mydf <- data.frame(Fruits = 
                   c("1_Fruit_A",   
                     "2_Fruit_A",   
                     "3_Fruit_A",   
                     "1_Fruit_B",   
                     "2_Fruit_B",   
                     "3_Fruit_B",   
                     "1_Fruit_C",   
                     "2_Fruit_C", 
                     "3_Fruit_C"), 
                  Number = c(18, 20, 24, 50, 63, 45, 71, 75, 72))


mydf$Fruits <- mydf %>% 
  pull(Fruits) %>% 
  gsub("^[[:digit:]] [_]", "", .) %>% 
  gsub("[_]", " ", .) %>% paste0(" - Average")

# Intermediate result
mydf
#              Fruits Number
# 1 Fruit A - Average     18
# 2 Fruit A - Average     20
# 3 Fruit A - Average     24
# 4 Fruit B - Average     50
# 5 Fruit B - Average     63
# 6 Fruit B - Average     45
# 7 Fruit C - Average     71
# 8 Fruit C - Average     75
# 9 Fruit C - Average     72


newdf <- mydf %>% group_by(Fruits) %>% summarise(Means = mean(Number), 
                                        `Standard Deviation` =  sd(Number))

# Final result

newdf
# # A tibble: 3 x 3
#    Fruits            Means  `Standard Deviation`
#     <chr>             <dbl>                <dbl>
# 1 Fruit A - Average  20.7                 3.06
# 2 Fruit B - Average  52.7                 9.29
# 3 Fruit C - Average  72.7                 2.08

I think this code serves your purpose, but I strongly recommend to remove the words Average in Fruits column because it is misleading. It implies that the values in Mean column is not the mean of values of each group, but instead the mean of the average of the values of each group.