Home > Software design >  dplyr (R) - Average of a column per group
dplyr (R) - Average of a column per group

Time:06-16

I have an R dataframe with a column titled 'Normalized Month' for different entities. I am trying to evaluate average of the column 'Oil_Prod' grouped by 'Normalized Month'. Hence, the final dataframe should have just two columns; 'Normalized Month' and associated mean of oil prod for that 'Normalized Month' value.

Here is the code:

mydf2 <- mydf %>% filter(`Normalized Month` > 0)

mydf2$`Normalized Month` <- factor(mydf2$`Normalized Month`)
levels(mydf2$`Normalized Month`)

  GetAvg <- mydf2 %>% 
  group_by(mydf2$`Normalized Month`)%>% 
  arrange(mydf2, desc('Normalized Month'))%>%
  dplyr::summarize(Mean = mean(mydf2$Oil_Prod, na.rm=TRUE))

Instead of getting mean for each Normalized Month, I am getting a single value of mean.

What I need is a dataframe with a below structure:

'Normalized Month' | 'Mean' |

1 value1

2 value2

3 value 3

dput:

structure(list(Api_No14 = c("42227412150000", "42227412150000", 
"42227412150000", "42227412150000", "42227412150000", "42227412040000", 
"42227412040000", "42227412040000", "42227412040000", "42227412040000", 
"42227412190000", "42227412190000", "42227412190000", "42227412190000", 
"42227412190000", "42227412130000", "42227412130000", "42227412130000", 
"42227412130000", "42227412130000"), Prod_Inj_Date = structure(c(1643587200, 
1646006400, 1648684800, 1651276800, 1653955200, 1643587200, 1646006400, 
1648684800, 1651276800, 1653955200, 1643587200, 1646006400, 1648684800, 
1651276800, 1653955200, 1643587200, 1646006400, 1648684800, 1651276800, 
1653955200), tzone = "UTC", class = c("POSIXct", "POSIXt")), 
    `Normalized Month` = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 
    2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), .Label = c("1", 
    "2", "3", "4", "5"), class = "factor"), Oil_Prod = c(4612.588, 
    25118.511, 39470.742, 22936.157, 17928.723, 2997.715, 16893.884, 
    26115.048, 18989.759, 16545.852, 4624.317, 18241.738, 30150.569, 
    21997.824, 18308.631, 52.696, 9375.176, 26108.609, 16889.377, 
    14688.064), `CD Oil Rate` = c(148.793161290323, 897.089678571429, 
    1273.24974193548, 764.538566666667, 578.345903225807, 96.7004838709677, 
    603.353, 842.420903225806, 632.991966666667, 533.737161290323, 
    149.171516129032, 651.490642857143, 972.599, 733.2608, 590.601, 
    1.69987096774194, 334.827714285714, 842.213193548387, 562.979233333333, 
    473.808516129032)), row.names = c(NA, -20L), SBDFImportTypes = c(Api_No14 = "String", 
Prod_Inj_Date = "DateTime", `Normalized Month` = "Integer", Oil_Prod = "Real", 
`CD Oil Rate` = "Real"), class = "data.frame")

CodePudding user response:

When you use mydf2$Oil_Prod you are referencing the external data frame, not the one from the pipe. I think if you stop subsetting within your dplyr functions, things will work as expected:

# assigning your dput to mydf
mydf %>% group_by(`Normalized Month`) %>% summarise(mean = mean(Oil_Prod, na.rm = TRUE))

# output
  `Normalized Month`   mean
  <fct>               <dbl>
1 1                   3072.
2 2                  17407.
3 3                  30461.
4 4                  20203.
5 5                  16868.
  • Related