I have a dataframe as below

year district
2017 arrah
2017 buxar
2017 rohtas
2018 rohtas
2018 arwal
2018 seohar
2019 nawda
2019 buxar
2019 jamui

I want to subset data in a way that repeated district in 2018 or 2019 should not appear in the subset as shown below

year district
2017 arrah
2017 buxar
2017 rohtas
2018 arwal
2018 seohar
2019 nawda
2019 jamui

I have tried anti_join function but it is not solving my problem.

CodePudding user response：

dplyr

library(dplyr)
quux %>%
  group_by(district) %>%
  slice_min(year) %>%
  ungroup()
# # A tibble: 7 x 2
#    year district
#   <int> <chr>   
# 1  2017 arrah   
# 2  2018 arwal   
# 3  2017 buxar   
# 4  2019 jamui   
# 5  2019 nawda   
# 6  2017 rohtas  
# 7  2018 seohar

base R

quux[ave(quux$year, quux$district, FUN = function(y) y == min(y)) > 0, ]
#   year district
# 1 2017    arrah
# 2 2017    buxar
# 3 2017   rohtas
# 5 2018    arwal
# 6 2018   seohar
# 7 2019    nawda
# 9 2019    jamui

Data

quux <- structure(list(year = c(2017L, 2017L, 2017L, 2018L, 2018L, 2018L, 2019L, 2019L, 2019L), district = c("arrah", "buxar", "rohtas", "rohtas", "arwal", "seohar", "nawda", "buxar", "jamui")), class = "data.frame", row.names = c(NA, -9L))

CodePudding user response：

We can try the following data.table option

> library(data.table)

> setcolorder(setDT(df)[, .SD[which.min(year)], district], "year")[]
   year district
1: 2017    arrah
2: 2017    buxar
3: 2017   rohtas
4: 2018    arwal
5: 2018   seohar
6: 2019    nawda
7: 2019    jamui

CodePudding user response：

Here is the sample subsetting records having earliest years for every district. You may select other summarise (agregate) function:

library(dplyr)
df <- data.frame(
  year = c ( 
    2017, 
    2017, 
    2017,
    2018, 
    2018, 
    2018, 
    2019, 
    2019, 
    2019
  ),
  district = c (
    "arrah",
    "buxar",
    "rohtas",
    "rohtas",
    "arwal",
    "seohar",
    "nawda",
    "buxar",
    "jamui"
  )
)


df %>% group_by(district) %>% 
  summarise(year = min(year))