I have a dataframe with some duplicate measurements. I'd like to extract those duplicate measurements to df2
, then calculate their average value, remove the duplicates from the original df
, and insert the calculated averages back into the original df.
library(dplyr)
df <- data.frame(id = c('01_01','01_01','01_02','01_03','01_04','01_04'),
variable_1 = c('a','a','a','a','a','a'),
value = c(0.5,0.7,0.8,0.2,1.2,2.2),
variable_2 = c('cat','cat','pig','dog','chicken','chicken'),
place = c('Australia','Australia','France','Belguim','Germany','Germany'))
df2 <- df %>%
group_by(id,variable_1,place) %>%
filter(n()==2) %>%
dplyr::summarise(value = mean(value))
df <- df %>%
group_by(id,variable_1,place) %>%
filter(n()==1) %>%
rbind(df2)
After pulling out the duplicates and calculating their means using summarise, not all of the columns in df2
are retained. This means it's not possible to join the dataframes back up without getting NAs for variable_2
. Is there another way to summarise whilst retaining the structure of the original dataframe? I could obviously add variable_2
to the group_by function, but in my actual dataset there are dozens of columns, which would be unweildy.
id variable_1 value variable_2 place
<chr> <chr> <dbl> <chr> <chr>
1 01_02 a 0.8 pig France
2 01_03 a 0.2 dog Belguim
3 01_01 a 0.6 NA Australia
4 01_04 a 1.7 NA Germany
CodePudding user response:
If you are just wanting to get the mean of any group that has more than 1 row, then you don't need to separate out, since nothing will happen to just one row in a group. Here, I add max
for variable_2
, so that it only returns one value and so it is retained in the output.
library(tidyverse)
df %>%
group_by(id,variable_1,place) %>%
dplyr::summarise(value = mean(value), variable_2 = max(variable_2))
Output
id variable_1 place value variable_2
<chr> <chr> <chr> <dbl> <chr>
1 01_01 a Australia 0.6 cat
2 01_02 a France 0.8 pig
3 01_03 a Belguim 0.2 dog
4 01_04 a Germany 1.7 chicken
Or if you do want to have it broken up, then you can just add an additional summary for variable_2
, so that it doesn't get removed.
df2 <- df %>%
group_by(id,variable_1,place) %>%
filter(n()==2) %>%
dplyr::summarise(value = mean(value), variable_2 = max(variable_2))
df <- df %>%
group_by(id,variable_1,place) %>%
filter(n()==1) %>%
bind_rows(., df2)