Home > OS >  Is there a reproducible way of making a table of means from an existing table
Is there a reproducible way of making a table of means from an existing table

Time:03-18

I am a beginner at R so please bear with me.

A toy-example of my problem would be somewhat like this -

Fruits Number
1_Fruit_A 18
2_Fruit_A 20
3_Fruit_A 24
1_Fruit_B 50
2_Fruit_B 63
3_Fruit_B 45
1_Fruit_C 71
2_Fruit_C 75
3_Fruit_C 72

And from this table, I wish to create another table wherein I store data somewhat like this -

Fruits Means Standard Deviation
Fruit A - Average 20.67 3
Fruit B - Average 52.67 2
Fruit C - Average 72.67 4

I need to write a reproducible code which I can use for say even coloured balls or flowers instead of fruits, but always a mean of triplicates which needs to be stored on another table with one column renamed and one column of means, I will then use ggplot to plot the resultant table. Any help would be greatly appreciated.

CodePudding user response:

You can extract the common value from Fruits column and use it as group and then take average value in each group.

For your example, you can drop everything after a -.

Using dplyr :

library(dplyr)

df %>%
  group_by(Fruits = sub('\\s-.*', '', Fruits)) %>%
  summarise(Number = mean(Number), .groups = 'drop')

#   Fruits   Number
#1 Fruit A 20.66667
#2 Fruit B 52.66667
#3 Fruit C 72.66667

In base R,

aggregate(Number~Fruits, transform(df, Fruits = sub('\\s-.*', '', Fruits)), mean)

CodePudding user response:

Using data.table

library(data.table)

# create a new column which includes common string to aggregate upon

setDT(dfs)[,':='(
  aggstr = substr(Fruits, 0, nchar(Fruits)-1)
  # OR using regex 
  # aggstr = sub('\\s-.*', '', Fruits)
)]
result.dt <- dfs[,.(Avg = mean(Number) , Sd= sd(Number)), by="aggstr"]

EDIT: added standard deviation

CodePudding user response:

Very similar to @Ronak Shah's solution, but here I include a reproducible data and customized text manipulation on Fruits column, based on your updated data:

mydf <- data.frame(Fruits = 
                   c("1_Fruit_A",   
                     "2_Fruit_A",   
                     "3_Fruit_A",   
                     "1_Fruit_B",   
                     "2_Fruit_B",   
                     "3_Fruit_B",   
                     "1_Fruit_C",   
                     "2_Fruit_C", 
                     "3_Fruit_C"), 
                  Number = c(18, 20, 24, 50, 63, 45, 71, 75, 72))


mydf$Fruits <- mydf %>% 
  pull(Fruits) %>% 
  gsub("^[[:digit:]] [_]", "", .) %>% 
  gsub("[_]", " ", .) %>% paste0(" - Average")

# Intermediate result
mydf
#              Fruits Number
# 1 Fruit A - Average     18
# 2 Fruit A - Average     20
# 3 Fruit A - Average     24
# 4 Fruit B - Average     50
# 5 Fruit B - Average     63
# 6 Fruit B - Average     45
# 7 Fruit C - Average     71
# 8 Fruit C - Average     75
# 9 Fruit C - Average     72


newdf <- mydf %>% group_by(Fruits) %>% summarise(Means = mean(Number), 
                                        `Standard Deviation` =  sd(Number))

# Final result

newdf
# # A tibble: 3 x 3
#    Fruits            Means  `Standard Deviation`
#     <chr>             <dbl>                <dbl>
# 1 Fruit A - Average  20.7                 3.06
# 2 Fruit B - Average  52.7                 9.29
# 3 Fruit C - Average  72.7                 2.08

I think this code serves your purpose, but I strongly recommend to remove the words Average in Fruits column because it is misleading. It implies that the values in Mean column is not the mean of values of each group, but instead the mean of the average of the values of each group.

  • Related