Home > Software engineering >  After summarize, reinsert calculated values into original dataframe dplyr
After summarize, reinsert calculated values into original dataframe dplyr

Time:04-01

I have a dataframe with some duplicate measurements. I'd like to extract those duplicate measurements to df2, then calculate their average value, remove the duplicates from the original df, and insert the calculated averages back into the original df.

library(dplyr)
df <- data.frame(id = c('01_01','01_01','01_02','01_03','01_04','01_04'),
                 variable_1 = c('a','a','a','a','a','a'),
                 value = c(0.5,0.7,0.8,0.2,1.2,2.2),
                 variable_2 = c('cat','cat','pig','dog','chicken','chicken'),
                 place = c('Australia','Australia','France','Belguim','Germany','Germany'))

df2 <- df %>% 
  group_by(id,variable_1,place) %>%
  filter(n()==2) %>%
  dplyr::summarise(value = mean(value)) 

df <- df %>%
  group_by(id,variable_1,place) %>%
  filter(n()==1) %>%
  rbind(df2)

After pulling out the duplicates and calculating their means using summarise, not all of the columns in df2 are retained. This means it's not possible to join the dataframes back up without getting NAs for variable_2. Is there another way to summarise whilst retaining the structure of the original dataframe? I could obviously add variable_2 to the group_by function, but in my actual dataset there are dozens of columns, which would be unweildy.

  id    variable_1 value variable_2 place    
  <chr> <chr>      <dbl> <chr>      <chr>    
1 01_02 a            0.8 pig        France   
2 01_03 a            0.2 dog        Belguim  
3 01_01 a            0.6 NA         Australia
4 01_04 a            1.7 NA         Germany 

CodePudding user response:

If you are just wanting to get the mean of any group that has more than 1 row, then you don't need to separate out, since nothing will happen to just one row in a group. Here, I add max for variable_2, so that it only returns one value and so it is retained in the output.

library(tidyverse)

df %>% 
  group_by(id,variable_1,place) %>% 
  dplyr::summarise(value = mean(value), variable_2 = max(variable_2))

Output

  id    variable_1 place     value variable_2
  <chr> <chr>      <chr>     <dbl> <chr>     
1 01_01 a          Australia   0.6 cat       
2 01_02 a          France      0.8 pig       
3 01_03 a          Belguim     0.2 dog       
4 01_04 a          Germany     1.7 chicken   

Or if you do want to have it broken up, then you can just add an additional summary for variable_2, so that it doesn't get removed.

df2 <- df %>% 
  group_by(id,variable_1,place) %>%
  filter(n()==2) %>%
  dplyr::summarise(value = mean(value), variable_2 = max(variable_2))

df <- df %>%
  group_by(id,variable_1,place) %>%
  filter(n()==1) %>%
  bind_rows(., df2)
  • Related