Home > Net >  How to subset identical rows from dataframe by condition in R?
How to subset identical rows from dataframe by condition in R?

Time:08-13

From this dataframe I want to subset rows which have the identical "model", "Country" and "year" values but different "factor"

model <- c("A","B","C","A","A","C","B","A")
country <- c("Italy","Germany","Norway","Italy","France","Germany","Norway","Italy")
year <- c(2016,2016,2016,2016,2015,2015,2015,2015)
value <-  c(14,24,11,34,73,11,33,22)
factor <- c("NEW","OLD","OLD","OLD","OLD","OLD","NEW","NEW")

df <- data.frame(model,country,year,value, factor)

  model country year value factor
1     A   Italy 2016    14    NEW
2     B Germany 2016    24    OLD
3     C  Norway 2016    11    OLD
4     A   Italy 2016    34    OLD
5     A  France 2015    73    OLD
6     C Germany 2015    11    OLD
7     B  Norway 2015    33    NEW
8     A   Italy 2015    22    NEW

Expected output:

1     A   Italy 2016    22    NEW
4     A   Italy 2016    34    OLD

CodePudding user response:

What you could do is create a column which tells the number of duplicated based on the groups (model, country and year) and then filter like this:

model <- c("A","B","C","A","A","C","B","A")
country <- c("Italy","Germany","Norway","Italy","France","Germany","Norway","Italy")
year <- c(2016,2016,2016,2016,2015,2015,2015,2015)
value <-  c(14,24,11,34,73,11,33,22)
factor <- c("NEW","OLD","OLD","OLD","OLD","OLD","NEW","NEW")

df <- data.frame(model,country,year,value, factor)

library(dplyr)

df %>% 
  group_by(model, country, year) %>% 
  mutate(number_dups = n()) %>% 
  filter(number_dups > 1) %>%
  select(-number_dups) %>%
  ungroup()
#> # A tibble: 2 × 5
#>   model country  year value factor
#>   <chr> <chr>   <dbl> <dbl> <chr> 
#> 1 A     Italy    2016    14 NEW   
#> 2 A     Italy    2016    34 OLD

Created on 2022-08-12 by the reprex package (v2.0.1)

Please note: I assume that the expected output should be 14 instead of 22.

CodePudding user response:

You can group by all columns, except for factor, then keep only the rows where there is more than 1 observation that is the same. Then, to ensure we only have those with NEW and OLD, we can add factor to the group, and keep only the first row of each.

library(dplyr)
    
df %>% 
  group_by(across(c(model,country,year))) %>% 
  filter(n() > 1) %>% 
  group_by(factor, .add = T) %>% 
  slice(1)

Output

  model country  year value factor
  <chr> <chr>   <dbl> <dbl> <chr> 
1 A     Italy    2016    34 NEW   
2 A     Italy    2016    34 OLD   
  • Related