Home > Net >  R - Conditional mean based on two sets of values from a single other column
R - Conditional mean based on two sets of values from a single other column

Time:10-28

I fully admit it was not my best question Title.

I am trying to create a field that is a mean.

I have data below with an ID column that ranges from 1-5. The mean of any single value in the ID column is derived from values in the Value column for the single ID value and the two immediate prior values.

For ID 5, the mean would be all values in the Value column with an ID of 5, 4, and 3.

I am looking for something concise that calculates a mean for a single value in the ID column, based on a named range of values in the ID column, and populates the result only in the single value.

The only way I've found to do it so far is to filter the main dataset for the specific named range, get a mean for all rows of the filtered dataset, filter again to keep only the single ID value, then merge it back into the main dataset. That's long and messy and I have a lot of data.

Code is below to create an initial dataframe, then each step to get to the end result of df6.

Solutions with tidyverse would be vastly preferrable.

My initial thought was to use ifelse or case_when, but I quickly found that isolating specific values in the ID column and taking the mean of another column gets me the mean of everything in that other column and not the mean of the values specified by case or ifelse. I had thought about rowwise, which(), or a group_by that specifies values within a column, but could not get those to work.

df1 <- data.frame(
  `ID` = as.character(sample(1:5, 20, replace = TRUE)),
  Value = sample(1:50, 20, replace = FALSE))

df1 %>%
  mutate(MeanAll = mean(Value)) %>%
  arrange(ID) -> df2

df1 %>%
  filter(ID %in% c('5', '4', '3')) %>%
  mutate(MeanID5 = mean(Value)) %>%
  filter(ID == '5') %>%
  distinct(ID, MeanID5) -> df3

df1 %>%
  filter(ID %in% c('4', '3', '2')) %>%
  mutate(MeanID4 = mean(Value)) %>%
  filter(ID == '4') %>%
  distinct(ID, MeanID4)-> df4

df1 %>%
  filter(ID %in% c('3', '2','1')) %>%
  mutate(MeanID3 = mean(Value)) %>%
  filter(ID == '3') %>%
  distinct(ID, MeanID3)-> df5

df1 %>%
  left_join(df2, by = c('ID', 'Value')) %>%
  left_join(df3, by = c('ID')) %>%
  left_join(df4, by = c('ID')) %>%
  left_join(df5, by = c('ID')) %>%
  arrange(ID, Value) -> df6

CodePudding user response:

Here's a refactor using tidyverse functions:

library(tidyverse)

set.seed(1)

mean_func <- function(data, values) {
  
  max <- max(as.numeric(values))
  max_colname <- paste0("Mean", max)
  
  data %>% 
    filter(ID %in% values) %>% 
    mutate("{paste0('MeanID', max)}" := mean(Value)) %>% 
    filter(ID == max(values)) %>% 
    select(-Value) %>% 
    distinct()
}

df1 <- data.frame(
  `ID` = as.character(sample(1:5, 20, replace = TRUE)),
  Value = sample(1:50, 20, replace = FALSE))

df2 <- df1 %>%
  mutate(MeanAll = mean(Value)) %>%
  arrange(ID) 


vals <- list(c('5', '4', '3'), c('4', '3', '2'), c('3', '2', '1'))

df1 %>%
  left_join(df2, by = c('ID', 'Value')) %>% 
  left_join(map_df(vals, ~mean_func(df1, .))) %>% 
  arrange(ID)
#> Joining, by = "ID"
#>    ID Value MeanAll MeanID5 MeanID4  MeanID3
#> 1   1    37    31.6      NA      NA       NA
#> 2   1    42    31.6      NA      NA       NA
#> 3   1     6    31.6      NA      NA       NA
#> 4   1    39    31.6      NA      NA       NA
#> 5   1    47    31.6      NA      NA       NA
#> 6   1    48    31.6      NA      NA       NA
#> 7   2    25    31.6      NA      NA       NA
#> 8   2    33    31.6      NA      NA       NA
#> 9   2    43    31.6      NA      NA       NA
#> 10  2    28    31.6      NA      NA       NA
#> 11  3    15    31.6      NA      NA 32.15385
#> 12  3    20    31.6      NA      NA 32.15385
#> 13  3    35    31.6      NA      NA 32.15385
#> 14  4    34    31.6      NA  29.125       NA
#> 15  5    44    31.6    28.4      NA       NA
#> 16  5    10    31.6    28.4      NA       NA
#> 17  5    38    31.6    28.4      NA       NA
#> 18  5    23    31.6    28.4      NA       NA
#> 19  5    41    31.6    28.4      NA       NA
#> 20  5    24    31.6    28.4      NA       NA

CodePudding user response:

You can use map with reduce to merge everything to one table:

map(1:3, seq, length.out = 3) %>%
  setNames(map_dbl(., last)) %>%
  imap(
    ~tibble(
      ID = .y, 
      !!paste0("MeanID", .y) := mean(pull(filter(df1, ID %in% .x), Value))
    )
  ) %>%
  reduce(left_join, .init = df1, by = "ID") %>%
  arrange(ID) %>%
  mutate(MeanAll = mean(Value))
  • Related